2025-12-04T09:15:48.8998234Z Current runner version: '2.330.0' 2025-12-04T09:15:48.9004683Z Runner name: 'i-0513695dee1ce902e' 2025-12-04T09:15:48.9005429Z Runner group name: 'default' 2025-12-04T09:15:48.9006250Z Machine name: 'ip-10-0-37-220' 2025-12-04T09:15:48.9008959Z ##[group]GITHUB_TOKEN Permissions 2025-12-04T09:15:48.9011238Z Contents: read 2025-12-04T09:15:48.9011740Z Metadata: read 2025-12-04T09:15:48.9012331Z ##[endgroup] 2025-12-04T09:15:48.9014231Z Secret source: Actions 2025-12-04T09:15:48.9014873Z Prepare workflow directory 2025-12-04T09:15:48.9516162Z Prepare all required actions 2025-12-04T09:15:48.9552115Z Getting action download info 2025-12-04T09:15:49.3237526Z Download action repository 'pytorch/test-infra@main' (SHA:39aa74d619174326f4e2fb0e216151c2f29d9ffd) 2025-12-04T09:15:51.7733936Z Download action repository 'pytorch/pytorch@main' (SHA:7716da9fb23f27a65b41f9f016a2afadf281c18f) 2025-12-04T09:16:07.7922621Z Download action repository 'actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065' (SHA:a26af69be951a213d495a4c3e4e4022e16d87065) 2025-12-04T09:16:08.1244660Z Download action repository 'aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722' (SHA:ececac1a45f3b08a01d2dd070d28d111c5fe6722) 2025-12-04T09:16:08.4026945Z Download action repository 'aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076' (SHA:062b18b96a7aff071d4dc91bc00c4c1a7945b076) 2025-12-04T09:16:08.5656852Z Download action repository 'seemethere/download-artifact-s3@1da556a7aa0a088e3153970611f6c432d58e80e6' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T09:16:08.8370028Z Download action repository 'seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T09:16:09.1209614Z Getting action download info 2025-12-04T09:16:09.2880113Z Download action repository 'actions/checkout@v4' (SHA:34e114876b0b11c390a56381ad16ebd13914f8d5) 2025-12-04T09:16:09.8916236Z Getting action download info 2025-12-04T09:16:10.0920976Z Download action repository 'nick-fields/retry@v3.0.0' (SHA:7152eba30c6575329ac0576536151aca5a72780e) 2025-12-04T09:16:10.2865158Z Getting action download info 2025-12-04T09:16:10.4189916Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-12-04T09:16:10.6326946Z Getting action download info 2025-12-04T09:16:10.8003192Z Uses: pytorch/pytorch/.github/workflows/_linux-test.yml@refs/heads/main (ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32) 2025-12-04T09:16:10.8006894Z ##[group] Inputs 2025-12-04T09:16:10.8007311Z build-environment: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck 2025-12-04T09:16:10.8015639Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:16:10.8024395Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:16:10.8025190Z sync-tag: 2025-12-04T09:16:10.8025902Z timeout-minutes: 300 2025-12-04T09:16:10.8026168Z use-gha: 2025-12-04T09:16:10.8026397Z dashboard-tag: 2025-12-04T09:16:10.8026652Z s3-bucket: gha-artifacts 2025-12-04T09:16:10.8026938Z aws-role-to-assume: 2025-12-04T09:16:10.8027459Z disable-monitor: false 2025-12-04T09:16:10.8027761Z monitor-log-interval: 5 2025-12-04T09:16:10.8028070Z monitor-data-collect-interval: 1 2025-12-04T09:16:10.8028377Z ##[endgroup] 2025-12-04T09:16:10.8029063Z Complete job name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:16:10.8722823Z A job started hook has been configured by the self-hosted runner administrator 2025-12-04T09:16:10.8835000Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-12-04T09:16:10.8846447Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:16:10.8847019Z ##[endgroup] 2025-12-04T09:16:12.5123898Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-12-04T09:16:12.5124325Z Instance Type: g5.4xlarge 2025-12-04T09:16:12.5124589Z AMI Name: unknown 2025-12-04T09:16:12.5174851Z AMI ID: ami-08982f1c5bf93d976 2025-12-04T09:16:18.0100664Z ##[group]Run pytorch/test-infra/.github/actions/setup-ssh@main 2025-12-04T09:16:18.0101111Z with: 2025-12-04T09:16:18.0101628Z github-secret: *** 2025-12-04T09:16:18.0102294Z instructions: All testing is done inside the container, to start an interactive session run: docker exec -it $(docker container ps --format '{{.ID}}') bash 2025-12-04T09:16:18.0103035Z activate-with-label: false 2025-12-04T09:16:18.0103312Z label: with-ssh 2025-12-04T09:16:18.0103555Z remove-existing-keys: true 2025-12-04T09:16:18.0103837Z fail-silently: true 2025-12-04T09:16:18.0104110Z env: 2025-12-04T09:16:18.0104350Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:16:18.0104625Z ##[endgroup] 2025-12-04T09:16:18.1521041Z Please see https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-for-Github-Actions for more info. 2025-12-04T09:16:18.1522508Z Not on pull request and ciflow reference could not be extracted, skipping adding ssh keys 2025-12-04T09:16:18.1703034Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2025-12-04T09:16:18.1703456Z with: 2025-12-04T09:16:18.1703680Z no-sudo: true 2025-12-04T09:16:18.1703934Z submodules: recursive 2025-12-04T09:16:18.1704244Z fetch-depth: 0 2025-12-04T09:16:18.1704668Z env: 2025-12-04T09:16:18.1704888Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:16:18.1705150Z ##[endgroup] 2025-12-04T09:16:18.1775273Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:16:18.1776173Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:16:18.1790222Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:16:18.1790595Z env: 2025-12-04T09:16:18.1790833Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:16:18.1791132Z ##[endgroup] 2025-12-04T09:16:18.1894720Z ##[group]Run # Use all available CPUs for fetching 2025-12-04T09:16:18.1895150Z # Use all available CPUs for fetching 2025-12-04T09:16:18.1895492Z cd "${GITHUB_WORKSPACE}" 2025-12-04T09:16:18.1895823Z git config --global fetch.parallel 0 2025-12-04T09:16:18.1896210Z git config --global submodule.fetchJobs 0 2025-12-04T09:16:18.1896566Z  2025-12-04T09:16:18.1896914Z # Clean workspace. The default checkout action should also do this, but 2025-12-04T09:16:18.1897387Z # do it here as well just in case 2025-12-04T09:16:18.1897720Z if [[ -d .git ]]; then 2025-12-04T09:16:18.1898015Z  if [ -z "${NO_SUDO}" ]; then 2025-12-04T09:16:18.1898329Z  sudo git clean -ffdx 2025-12-04T09:16:18.1898606Z  else 2025-12-04T09:16:18.1898837Z  git clean -ffdx 2025-12-04T09:16:18.1899098Z  fi 2025-12-04T09:16:18.1899316Z fi 2025-12-04T09:16:18.1908673Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:16:18.1909057Z env: 2025-12-04T09:16:18.1909345Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:16:18.1909633Z NO_SUDO: true 2025-12-04T09:16:18.1909850Z ##[endgroup] 2025-12-04T09:16:18.2050555Z ##[group]Run actions/checkout@v4 2025-12-04T09:16:18.2050867Z with: 2025-12-04T09:16:18.2051134Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:16:18.2051476Z fetch-depth: 0 2025-12-04T09:16:18.2051725Z submodules: recursive 2025-12-04T09:16:18.2052001Z show-progress: false 2025-12-04T09:16:18.2052283Z repository: pytorch/pytorch 2025-12-04T09:16:18.2052670Z token: *** 2025-12-04T09:16:18.2052900Z ssh-strict: true 2025-12-04T09:16:18.2053142Z ssh-user: git 2025-12-04T09:16:18.2053397Z persist-credentials: true 2025-12-04T09:16:18.2053672Z clean: true 2025-12-04T09:16:18.2053948Z sparse-checkout-cone-mode: true 2025-12-04T09:16:18.2054287Z fetch-tags: false 2025-12-04T09:16:18.2054527Z lfs: false 2025-12-04T09:16:18.2054767Z set-safe-directory: true 2025-12-04T09:16:18.2055048Z env: 2025-12-04T09:16:18.2055263Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:16:18.2055525Z ##[endgroup] 2025-12-04T09:16:18.3200690Z Syncing repository: pytorch/pytorch 2025-12-04T09:16:18.3201951Z ##[group]Getting Git version info 2025-12-04T09:16:18.3202410Z Working directory is '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2025-12-04T09:16:18.3203036Z [command]/usr/bin/git version 2025-12-04T09:16:18.3409420Z git version 2.50.1 2025-12-04T09:16:18.3454647Z ##[endgroup] 2025-12-04T09:16:18.3466261Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/337fc91e-637c-4da5-8191-cc804d6fde18/.gitconfig' 2025-12-04T09:16:18.3489441Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/337fc91e-637c-4da5-8191-cc804d6fde18' before making global git config changes 2025-12-04T09:16:18.3490366Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T09:16:18.3495144Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:16:18.3550130Z Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2025-12-04T09:16:18.3553742Z ##[group]Initializing the repository 2025-12-04T09:16:18.3558609Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:16:18.3636544Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-12-04T09:16:18.3637148Z hint: is subject to change. To configure the initial branch name to use in all 2025-12-04T09:16:18.3637693Z hint: of your new repositories, which will suppress this warning, call: 2025-12-04T09:16:18.3638094Z hint: 2025-12-04T09:16:18.3638402Z hint: git config --global init.defaultBranch 2025-12-04T09:16:18.3638744Z hint: 2025-12-04T09:16:18.3639077Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-12-04T09:16:18.3639740Z hint: 'development'. The just-created branch can be renamed via this command: 2025-12-04T09:16:18.3640170Z hint: 2025-12-04T09:16:18.3640399Z hint: git branch -m 2025-12-04T09:16:18.3640660Z hint: 2025-12-04T09:16:18.3641036Z hint: Disable this message with "git config set advice.defaultBranchName false" 2025-12-04T09:16:18.3646172Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/ 2025-12-04T09:16:18.3657405Z [command]/usr/bin/git remote add origin https://github.com/pytorch/pytorch 2025-12-04T09:16:18.3706759Z ##[endgroup] 2025-12-04T09:16:18.3707205Z ##[group]Disabling automatic garbage collection 2025-12-04T09:16:18.3710638Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T09:16:18.3744966Z ##[endgroup] 2025-12-04T09:16:18.3745365Z ##[group]Setting up auth 2025-12-04T09:16:18.3751397Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T09:16:18.3786328Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T09:16:18.4224105Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T09:16:18.4259478Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T09:16:18.4664456Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:16:18.4701074Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T09:16:18.5090852Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T09:16:18.5155585Z ##[endgroup] 2025-12-04T09:16:18.5156285Z ##[group]Fetching the repository 2025-12-04T09:16:18.5165014Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T09:17:10.8016861Z From https://github.com/pytorch/pytorch 2025-12-04T09:17:10.8017440Z * [new branch] 2.6.0.dev20241004+ -> origin/2.6.0.dev20241004+ 2025-12-04T09:17:10.8017939Z * [new branch] 2.9.1 -> origin/2.9.1 2025-12-04T09:17:10.8018554Z * [new branch] AaronWang04_addmmfusion_perftest -> origin/AaronWang04_addmmfusion_perftest 2025-12-04T09:17:10.8020031Z * [new branch] Flamefire-patch-1 -> origin/Flamefire-patch-1 2025-12-04T09:17:10.8022553Z * [new branch] HDCharles-2.6.0-release-notes -> origin/HDCharles-2.6.0-release-notes 2025-12-04T09:17:10.8024683Z * [new branch] HOPrintFunc -> origin/HOPrintFunc 2025-12-04T09:17:10.8028367Z * [new branch] IvanKobzarev/stack/1 -> origin/IvanKobzarev/stack/1 2025-12-04T09:17:10.8031458Z * [new branch] NicoshevSVE128 -> origin/NicoshevSVE128 2025-12-04T09:17:10.8033495Z * [new branch] PR-AOTInductorNoneBug -> origin/PR-AOTInductorNoneBug 2025-12-04T09:17:10.8035831Z * [new branch] PR-AOTInductorNoneBugFix -> origin/PR-AOTInductorNoneBugFix 2025-12-04T09:17:10.8037934Z * [new branch] PR-FixConfigsIssue -> origin/PR-FixConfigsIssue 2025-12-04T09:17:10.8040128Z * [new branch] PR-NoneBugFix-viable -> origin/PR-NoneBugFix-viable 2025-12-04T09:17:10.8042481Z * [new branch] PR-ResetToZero -> origin/PR-ResetToZero 2025-12-04T09:17:10.8044921Z * [new branch] Update-Flash-Packaging -> origin/Update-Flash-Packaging 2025-12-04T09:17:10.8047452Z * [new branch] VLA_exp -> origin/VLA_exp 2025-12-04T09:17:10.8049963Z * [new branch] activation_bench -> origin/activation_bench 2025-12-04T09:17:10.8053945Z * [new branch] addmm-heuristic -> origin/addmm-heuristic 2025-12-04T09:17:10.8054929Z * [new branch] adi/onednn_aarch64 -> origin/adi/onednn_aarch64 2025-12-04T09:17:10.8057769Z * [new branch] adi/test -> origin/adi/test 2025-12-04T09:17:10.8059620Z * [new branch] adi/test_bgemm -> origin/adi/test_bgemm 2025-12-04T09:17:10.8062119Z * [new branch] adi/test_m8g -> origin/adi/test_m8g 2025-12-04T09:17:10.8064310Z * [new branch] adi/test_onednn -> origin/adi/test_onednn 2025-12-04T09:17:10.8066520Z * [new branch] adi/test_onednn_v3.9 -> origin/adi/test_onednn_v3.9 2025-12-04T09:17:10.8068277Z * [new branch] adi/test_presve_change -> origin/adi/test_presve_change 2025-12-04T09:17:10.8070508Z * [new branch] adi/test_timm -> origin/adi/test_timm 2025-12-04T09:17:10.8073043Z * [new branch] adi/testpresve_change -> origin/adi/testpresve_change 2025-12-04T09:17:10.8077024Z * [new branch] aditew01/test/vec_bf16 -> origin/aditew01/test/vec_bf16 2025-12-04T09:17:10.8079259Z * [new branch] ah-globalfeedback-hook -> origin/ah-globalfeedback-hook 2025-12-04T09:17:10.8082070Z * [new branch] albanD-patch-1 -> origin/albanD-patch-1 2025-12-04T09:17:10.8083923Z * [new branch] also-surround-shimh -> origin/also-surround-shimh 2025-12-04T09:17:10.8087264Z * [new branch] angelayi/aot_compile -> origin/angelayi/aot_compile 2025-12-04T09:17:10.8089274Z * [new branch] angelayi/aoti_additional_files -> origin/angelayi/aoti_additional_files 2025-12-04T09:17:10.8091636Z * [new branch] angelayi/benchmark -> origin/angelayi/benchmark 2025-12-04T09:17:10.8093575Z * [new branch] angelayi/change_pytree_serialization -> origin/angelayi/change_pytree_serialization 2025-12-04T09:17:10.8095549Z * [new branch] angelayi/cpp_loader -> origin/angelayi/cpp_loader 2025-12-04T09:17:10.8098201Z * [new branch] angelayi/inductor_const -> origin/angelayi/inductor_const 2025-12-04T09:17:10.8100096Z * [new branch] angelayi/lstm -> origin/angelayi/lstm 2025-12-04T09:17:10.8103505Z * [new branch] angelayi/no_so_weight -> origin/angelayi/no_so_weight 2025-12-04T09:17:10.8106000Z * [new branch] angelayi/scan_layers -> origin/angelayi/scan_layers 2025-12-04T09:17:10.8107881Z * [new branch] angelayi/side_eff -> origin/angelayi/side_eff 2025-12-04T09:17:10.8110706Z * [new branch] angelayi/state_dict -> origin/angelayi/state_dict 2025-12-04T09:17:10.8112498Z * [new branch] angelayi/symint_input -> origin/angelayi/symint_input 2025-12-04T09:17:10.8115456Z * [new branch] angelayi/symm_mem -> origin/angelayi/symm_mem 2025-12-04T09:17:10.8116935Z * [new branch] angelayi/test_cpp -> origin/angelayi/test_cpp 2025-12-04T09:17:10.8119692Z * [new branch] angelayi/torch_size -> origin/angelayi/torch_size 2025-12-04T09:17:10.8121708Z * [new branch] annotate_assert -> origin/annotate_assert 2025-12-04T09:17:10.8124375Z * [new branch] annotate_fallback_kernel -> origin/annotate_fallback_kernel 2025-12-04T09:17:10.8126335Z * [new branch] annotation_deepcopy -> origin/annotation_deepcopy 2025-12-04T09:17:10.8128841Z * [new branch] annotation_dynamo -> origin/annotation_dynamo 2025-12-04T09:17:10.8131128Z * [new branch] aot_eager_stack_trace -> origin/aot_eager_stack_trace 2025-12-04T09:17:10.8133066Z * [new branch] aoti-cuda-alloc -> origin/aoti-cuda-alloc 2025-12-04T09:17:10.8135645Z * [new branch] aoti_const_device -> origin/aoti_const_device 2025-12-04T09:17:10.8138040Z * [new branch] aoti_fqn_name_interface -> origin/aoti_fqn_name_interface 2025-12-04T09:17:10.8140342Z * [new branch] aoti_package_weights_binary -> origin/aoti_package_weights_binary 2025-12-04T09:17:10.8142753Z * [new branch] aoti_target_windows -> origin/aoti_target_windows 2025-12-04T09:17:10.8146951Z * [new branch] arsh/feat/inductor_check_profiling -> origin/arsh/feat/inductor_check_profiling 2025-12-04T09:17:10.8148780Z * [new branch] async_tp -> origin/async_tp 2025-12-04T09:17:10.8151769Z * [new branch] atalman-inductor-perf-cu124 -> origin/atalman-inductor-perf-cu124 2025-12-04T09:17:10.8154339Z * [new branch] atalman-inductor-perf-cu124.1 -> origin/atalman-inductor-perf-cu124.1 2025-12-04T09:17:10.8156891Z * [new branch] atalman-patch-2 -> origin/atalman-patch-2 2025-12-04T09:17:10.8159599Z * [new branch] atalman-patch-3 -> origin/atalman-patch-3 2025-12-04T09:17:10.8162271Z * [new branch] atalman-patch-4 -> origin/atalman-patch-4 2025-12-04T09:17:10.8164904Z * [new branch] atalman-patch-5 -> origin/atalman-patch-5 2025-12-04T09:17:10.8167415Z * [new branch] atalman-patch-6 -> origin/atalman-patch-6 2025-12-04T09:17:10.8170050Z * [new branch] atalman-patch-7 -> origin/atalman-patch-7 2025-12-04T09:17:10.8172670Z * [new branch] atalman-patch-8 -> origin/atalman-patch-8 2025-12-04T09:17:10.8175237Z * [new branch] atalman_inductor_2.3.1 -> origin/atalman_inductor_2.3.1 2025-12-04T09:17:10.8177746Z * [new branch] atalman_inductor_2.4.0 -> origin/atalman_inductor_2.4.0 2025-12-04T09:17:10.8180315Z * [new branch] atalman_inductor_2.4.x -> origin/atalman_inductor_2.4.x 2025-12-04T09:17:10.8183013Z * [new branch] attention_benchmarking_clean -> origin/attention_benchmarking_clean 2025-12-04T09:17:10.8186265Z * [new branch] bahuang/dt_fix_scalar_add -> origin/bahuang/dt_fix_scalar_add 2025-12-04T09:17:10.8188263Z * [new branch] bahuang/fix_debug_mode -> origin/bahuang/fix_debug_mode 2025-12-04T09:17:10.8209730Z * [new branch] bahuang/fix_expand -> origin/bahuang/fix_expand 2025-12-04T09:17:10.8210548Z * [new branch] bahuang/test -> origin/bahuang/test 2025-12-04T09:17:10.8211218Z * [new branch] base/1.5 -> origin/base/1.5 2025-12-04T09:17:10.8212037Z * [new branch] batching_sdpa_efficient_attention -> origin/batching_sdpa_efficient_attention 2025-12-04T09:17:10.8212917Z * [new branch] bench_scaled_mm_ops -> origin/bench_scaled_mm_ops 2025-12-04T09:17:10.8213999Z * [new branch] benchmark-updates -> origin/benchmark-updates 2025-12-04T09:17:10.8214827Z * [new branch] benchmarking-script -> origin/benchmarking-script 2025-12-04T09:17:10.8215617Z * [new branch] bertmaher/pinbump26 -> origin/bertmaher/pinbump26 2025-12-04T09:17:10.8216588Z * [new branch] bertrand/cutlass -> origin/bertrand/cutlass 2025-12-04T09:17:10.8218040Z * [new branch] bf/bug-static-input -> origin/bf/bug-static-input 2025-12-04T09:17:10.8220448Z * [new branch] bf/cg-backend -> origin/bf/cg-backend 2025-12-04T09:17:10.8222739Z * [new branch] bf/cg-nccl-test -> origin/bf/cg-nccl-test 2025-12-04T09:17:10.8225085Z * [new branch] bf/cg-remove-check -> origin/bf/cg-remove-check 2025-12-04T09:17:10.8227591Z * [new branch] bf/clean-torchbench-hf -> origin/bf/clean-torchbench-hf 2025-12-04T09:17:10.8230009Z * [new branch] bf/combo-debug-log -> origin/bf/combo-debug-log 2025-12-04T09:17:10.8232338Z * [new branch] bf/cudagraph -> origin/bf/cudagraph 2025-12-04T09:17:10.8235439Z * [new branch] bf/cudagraph-disable-input-mutation -> origin/bf/cudagraph-disable-input-mutation 2025-12-04T09:17:10.8238129Z * [new branch] bf/cudagraph-enable-input-mutation-support-benchmark -> origin/bf/cudagraph-enable-input-mutation-support-benchmark 2025-12-04T09:17:10.8240138Z * [new branch] bf/cudagraph-partition -> origin/bf/cudagraph-partition 2025-12-04T09:17:10.8242873Z * [new branch] bf/donated-buffer-bench -> origin/bf/donated-buffer-bench 2025-12-04T09:17:10.8245365Z * [new branch] bf/dynamo-partition -> origin/bf/dynamo-partition 2025-12-04T09:17:10.8247776Z * [new branch] bf/lite -> origin/bf/lite 2025-12-04T09:17:10.8250258Z * [new branch] bf/pa-non-divisible -> origin/bf/pa-non-divisible 2025-12-04T09:17:10.8252885Z * [new branch] bf/partition-cache-free-symbols -> origin/bf/partition-cache-free-symbols 2025-12-04T09:17:10.8255366Z * [new branch] bf/partition-memory-plan -> origin/bf/partition-memory-plan 2025-12-04T09:17:10.8257927Z * [new branch] bf/partition-move-cpu -> origin/bf/partition-move-cpu 2025-12-04T09:17:10.8260564Z * [new branch] bf/partition-view-fallback -> origin/bf/partition-view-fallback 2025-12-04T09:17:10.8262923Z * [new branch] bf/remove-check-55b0c39d -> origin/bf/remove-check-55b0c39d 2025-12-04T09:17:10.8265476Z * [new branch] bf/timm-nov-26-2025 -> origin/bf/timm-nov-26-2025 2025-12-04T09:17:10.8267836Z * [new branch] bf/transformer-pin-4-57-3 -> origin/bf/transformer-pin-4-57-3 2025-12-04T09:17:10.8270369Z * [new branch] bisect_perf_hf_T5_3acc6eac492 -> origin/bisect_perf_hf_T5_3acc6eac492 2025-12-04T09:17:10.8272790Z * [new branch] bisect_perf_hf_T5_3fcf66f61fb -> origin/bisect_perf_hf_T5_3fcf66f61fb 2025-12-04T09:17:10.8275220Z * [new branch] bisect_perf_hf_T5_4009d154129 -> origin/bisect_perf_hf_T5_4009d154129 2025-12-04T09:17:10.8277640Z * [new branch] bisect_perf_hf_T5_40d0740e73d -> origin/bisect_perf_hf_T5_40d0740e73d 2025-12-04T09:17:10.8280203Z * [new branch] bisect_perf_hf_T5_5268754e -> origin/bisect_perf_hf_T5_5268754e 2025-12-04T09:17:10.8282718Z * [new branch] bisect_perf_hf_T5_7d89a8d385c -> origin/bisect_perf_hf_T5_7d89a8d385c 2025-12-04T09:17:10.8285069Z * [new branch] bisect_perf_hf_T5_b7a25c1ee7c -> origin/bisect_perf_hf_T5_b7a25c1ee7c 2025-12-04T09:17:10.8287477Z * [new branch] bisect_perf_hf_T5_c25b201583f -> origin/bisect_perf_hf_T5_c25b201583f 2025-12-04T09:17:10.8290026Z * [new branch] bisect_perf_hf_T5_c93e57efac0 -> origin/bisect_perf_hf_T5_c93e57efac0 2025-12-04T09:17:10.8292788Z * [new branch] bisect_perf_hf_T5_ca9813ea149 -> origin/bisect_perf_hf_T5_ca9813ea149 2025-12-04T09:17:10.8295160Z * [new branch] bisect_perf_hf_T5_d65f194a -> origin/bisect_perf_hf_T5_d65f194a 2025-12-04T09:17:10.8297470Z * [new branch] bisect_perf_hf_T5_da94ab0b -> origin/bisect_perf_hf_T5_da94ab0b 2025-12-04T09:17:10.8300047Z * [new branch] bisect_perf_hf_T5_da94ab0b_new -> origin/bisect_perf_hf_T5_da94ab0b_new 2025-12-04T09:17:10.8302941Z * [new branch] bisect_perf_hf_T5_db4e8a1d8a8 -> origin/bisect_perf_hf_T5_db4e8a1d8a8 2025-12-04T09:17:10.8305350Z * [new branch] bisect_perf_hf_T5_e0d97e936a2 -> origin/bisect_perf_hf_T5_e0d97e936a2 2025-12-04T09:17:10.8307722Z * [new branch] bisect_perf_hf_T5_f23621ec563 -> origin/bisect_perf_hf_T5_f23621ec563 2025-12-04T09:17:10.8311108Z * [new branch] brister/fx_device_type -> origin/brister/fx_device_type 2025-12-04T09:17:10.8313394Z * [new branch] brister/test_inductor_all_fx -> origin/brister/test_inductor_all_fx 2025-12-04T09:17:10.8315802Z * [new branch] brister/tiled_reduction_no_numel_check -> origin/brister/tiled_reduction_no_numel_check 2025-12-04T09:17:10.8318223Z * [new branch] bwd-backup -> origin/bwd-backup 2025-12-04T09:17:10.8321011Z * [new branch] c57382a49 -> origin/c57382a49 2025-12-04T09:17:10.8323344Z * [new branch] ca_0431d47eaa -> origin/ca_0431d47eaa 2025-12-04T09:17:10.8325734Z * [new branch] ca_fix_0431d47eaa -> origin/ca_fix_0431d47eaa 2025-12-04T09:17:10.8329257Z * [new branch] camyllh/test_setup_hooks_push -> origin/camyllh/test_setup_hooks_push 2025-12-04T09:17:10.8331684Z * [new branch] cccclai-patch-1 -> origin/cccclai-patch-1 2025-12-04T09:17:10.8334356Z * [new branch] cherry-pick-159969-by-pytorch_bot_bot_ -> origin/cherry-pick-159969-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8336877Z * [new branch] cherry-pick-160586-by-pytorch_bot_bot_ -> origin/cherry-pick-160586-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8339569Z * [new branch] cherry-pick-162208-by-pytorch_bot_bot_ -> origin/cherry-pick-162208-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8342118Z * [new branch] cherry-pick-163169-by-pytorch_bot_bot_ -> origin/cherry-pick-163169-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8344638Z * [new branch] cherry-pick-165086-by-pytorch_bot_bot_ -> origin/cherry-pick-165086-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8347447Z * [new branch] cherry-pick-165514-by-pytorch_bot_bot_ -> origin/cherry-pick-165514-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8349843Z * [new branch] cherry-pick-165601-by-pytorch_bot_bot_ -> origin/cherry-pick-165601-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8352464Z * [new branch] cherry-pick-165667-by-pytorch_bot_bot_ -> origin/cherry-pick-165667-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8355046Z * [new branch] cherry-pick-165815-by-pytorch_bot_bot_ -> origin/cherry-pick-165815-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8357628Z * [new branch] cherry-pick-165922-by-pytorch_bot_bot_ -> origin/cherry-pick-165922-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8360471Z * [new branch] cherry-pick-166148-by-pytorch_bot_bot_ -> origin/cherry-pick-166148-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8362825Z * [new branch] cherry-pick-166181-by-pytorch_bot_bot_ -> origin/cherry-pick-166181-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8365401Z * [new branch] cherry-pick-166404-by-pytorch_bot_bot_ -> origin/cherry-pick-166404-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8367960Z * [new branch] cherry-pick-166427-by-pytorch_bot_bot_ -> origin/cherry-pick-166427-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8370768Z * [new branch] cherry-pick-166480-by-pytorch_bot_bot_ -> origin/cherry-pick-166480-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8372976Z * [new branch] cherry-pick-166570-by-pytorch_bot_bot_ -> origin/cherry-pick-166570-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8375667Z * [new branch] cherry-pick-166993-by-pytorch_bot_bot_ -> origin/cherry-pick-166993-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8378077Z * [new branch] cherry-pick-167111-by-pytorch_bot_bot_ -> origin/cherry-pick-167111-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8380881Z * [new branch] cherry-pick-167478-by-pytorch_bot_bot_ -> origin/cherry-pick-167478-by-pytorch_bot_bot_ 2025-12-04T09:17:10.8382929Z * [new branch] cherry_pick_166036_166040 -> origin/cherry_pick_166036_166040 2025-12-04T09:17:10.8385660Z * [new branch] cherry_pick_166457 -> origin/cherry_pick_166457 2025-12-04T09:17:10.8388114Z * [new branch] cherrypick_166338 -> origin/cherrypick_166338 2025-12-04T09:17:10.8390666Z * [new branch] cherrypick_166458 -> origin/cherrypick_166458 2025-12-04T09:17:10.8393078Z * [new branch] cherrypick_166586 -> origin/cherrypick_166586 2025-12-04T09:17:10.8395636Z * [new branch] cherrypick_166956 -> origin/cherrypick_166956 2025-12-04T09:17:10.8398091Z * [new branch] ci_attn -> origin/ci_attn 2025-12-04T09:17:10.8401022Z * [new branch] codex-testing -> origin/codex-testing 2025-12-04T09:17:10.8404748Z * [new branch] codex/add-check_memory_overlap-helper-functions -> origin/codex/add-check_memory_overlap-helper-functions 2025-12-04T09:17:10.8406380Z * [new branch] codex/fix-issue-121219-in-pytorch -> origin/codex/fix-issue-121219-in-pytorch 2025-12-04T09:17:10.8409772Z * [new branch] codex/investigate-segfaults-in-get_tensor_storage_id -> origin/codex/investigate-segfaults-in-get_tensor_storage_id 2025-12-04T09:17:10.8412221Z * [new branch] codex/refactor-lintrunner-config-to-use-uv-run -> origin/codex/refactor-lintrunner-config-to-use-uv-run 2025-12-04T09:17:10.8414174Z * [new branch] compatiblpy39util -> origin/compatiblpy39util 2025-12-04T09:17:10.8416972Z * [new branch] cond_hop_device -> origin/cond_hop_device 2025-12-04T09:17:10.8419368Z * [new branch] context_test -> origin/context_test 2025-12-04T09:17:10.8423336Z * [new branch] copilot/code-style-cleanup-python-pip -> origin/copilot/code-style-cleanup-python-pip 2025-12-04T09:17:10.8426304Z * [new branch] cpio/fix_new_ami_tests -> origin/cpio/fix_new_ami_tests 2025-12-04T09:17:10.8428901Z * [new branch] cpp-docs-dependency-upgrade -> origin/cpp-docs-dependency-upgrade 2025-12-04T09:17:10.8432221Z * [new branch] crpa/typo-in-inductor_comm_lowering -> origin/crpa/typo-in-inductor_comm_lowering 2025-12-04T09:17:10.8435273Z * [new branch] csl/always_produce_xml -> origin/csl/always_produce_xml 2025-12-04T09:17:10.8437496Z * [new branch] csl/build_test_more_procs -> origin/csl/build_test_more_procs 2025-12-04T09:17:10.8440730Z * [new branch] csl/build_test_more_procs2 -> origin/csl/build_test_more_procs2 2025-12-04T09:17:10.8442915Z * [new branch] csl/clean_up -> origin/csl/clean_up 2025-12-04T09:17:10.8445359Z * [new branch] csl/fix_retry_segfault_exit -> origin/csl/fix_retry_segfault_exit 2025-12-04T09:17:10.8447596Z * [new branch] csl/katex -> origin/csl/katex 2025-12-04T09:17:10.8450378Z * [new branch] csl/larger_runner -> origin/csl/larger_runner 2025-12-04T09:17:10.8453252Z * [new branch] csl/lint_testing -> origin/csl/lint_testing 2025-12-04T09:17:10.8456098Z * [new branch] csl/lint_thing -> origin/csl/lint_thing 2025-12-04T09:17:10.8458767Z * [new branch] csl/lintrunner_stuff -> origin/csl/lintrunner_stuff 2025-12-04T09:17:10.8461170Z * [new branch] csl/manually_gen_json -> origin/csl/manually_gen_json 2025-12-04T09:17:10.8463552Z * [new branch] csl/mps_sharding -> origin/csl/mps_sharding 2025-12-04T09:17:10.8466097Z * [new branch] csl/multistage_docker -> origin/csl/multistage_docker 2025-12-04T09:17:10.8468659Z * [new branch] csl/print_timing -> origin/csl/print_timing 2025-12-04T09:17:10.8471040Z * [new branch] csl/remove_experiment -> origin/csl/remove_experiment 2025-12-04T09:17:10.8473555Z * [new branch] csl/remove_maybe_unused_var -> origin/csl/remove_maybe_unused_var 2025-12-04T09:17:10.8476205Z * [new branch] csl/remove_repo_specific_autolabel -> origin/csl/remove_repo_specific_autolabel 2025-12-04T09:17:10.8478689Z * [new branch] csl/remove_run_parallel -> origin/csl/remove_run_parallel 2025-12-04T09:17:10.8481209Z * [new branch] csl/remove_unused_vars -> origin/csl/remove_unused_vars 2025-12-04T09:17:10.8483702Z * [new branch] csl/revert_open -> origin/csl/revert_open 2025-12-04T09:17:10.8486211Z * [new branch] csl/skip_build -> origin/csl/skip_build 2025-12-04T09:17:10.8488696Z * [new branch] csl/smaller_avx_amx_runenrs -> origin/csl/smaller_avx_amx_runenrs 2025-12-04T09:17:10.8491003Z * [new branch] csl/td_job_level -> origin/csl/td_job_level 2025-12-04T09:17:10.8493549Z * [new branch] csl/test_cuda_build_large_runner -> origin/csl/test_cuda_build_large_runner 2025-12-04T09:17:10.8496193Z * [new branch] csl/test_owners_autograd_dispatch_nn -> origin/csl/test_owners_autograd_dispatch_nn 2025-12-04T09:17:10.8498633Z * [new branch] csl/test_owners_higher_confidence -> origin/csl/test_owners_higher_confidence 2025-12-04T09:17:10.8501621Z * [new branch] csl/upload_json_running -> origin/csl/upload_json_running 2025-12-04T09:17:10.8503726Z * [new branch] csl/win_sccache -> origin/csl/win_sccache 2025-12-04T09:17:10.8505499Z * [new branch] csl/xml_stuff -> origin/csl/xml_stuff 2025-12-04T09:17:10.8507571Z * [new branch] cublasrelax2 -> origin/cublasrelax2 2025-12-04T09:17:10.8509505Z * [new branch] cuda_mempool -> origin/cuda_mempool 2025-12-04T09:17:10.8511477Z * [new branch] custom_lowering_dict -> origin/custom_lowering_dict 2025-12-04T09:17:10.8514008Z * [new branch] d4l3k/debug_plane_frtrace -> origin/d4l3k/debug_plane_frtrace 2025-12-04T09:17:10.8516749Z * [new branch] daxia6/2.8o3 -> origin/daxia6/2.8o3 2025-12-04T09:17:10.8518654Z * [new branch] debug-guard -> origin/debug-guard 2025-12-04T09:17:10.8520808Z * [new branch] delete-quant-docs -> origin/delete-quant-docs 2025-12-04T09:17:10.8526767Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 2025-12-04T09:17:10.8528708Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 2025-12-04T09:17:10.8531613Z * [new branch] desertfire/test_cpp_wrapper -> origin/desertfire/test_cpp_wrapper 2025-12-04T09:17:10.8533282Z * [new branch] desertfire/triton-cpu-for-aarch64 -> origin/desertfire/triton-cpu-for-aarch64 2025-12-04T09:17:10.8536797Z * [new branch] dev/dhruva/flex_attn_opt -> origin/dev/dhruva/flex_attn_opt 2025-12-04T09:17:10.8539643Z * [new branch] dev/joona/MPSNDArrayAdd -> origin/dev/joona/MPSNDArrayAdd 2025-12-04T09:17:10.8541865Z * [new branch] dev/joona/Unranked -> origin/dev/joona/Unranked 2025-12-04T09:17:10.8543902Z * [new branch] dev/joona/cat -> origin/dev/joona/cat 2025-12-04T09:17:10.8545780Z * [new branch] dev/joona/embeddingbag -> origin/dev/joona/embeddingbag 2025-12-04T09:17:10.8547766Z * [new branch] dev/joona/fix_sdpa_memtest -> origin/dev/joona/fix_sdpa_memtest 2025-12-04T09:17:10.8549975Z * [new branch] dev/joona/getTensorsString -> origin/dev/joona/getTensorsString 2025-12-04T09:17:10.8552103Z * [new branch] dev/joona/mps_linear_macos14 -> origin/dev/joona/mps_linear_macos14 2025-12-04T09:17:10.8554541Z * [new branch] dev/joona/scalar_clamp -> origin/dev/joona/scalar_clamp 2025-12-04T09:17:10.8557063Z * [new branch] dev/joona/sdpa -> origin/dev/joona/sdpa 2025-12-04T09:17:10.8559916Z * [new branch] dev/joona/sdpa_api -> origin/dev/joona/sdpa_api 2025-12-04T09:17:10.8562059Z * [new branch] dev/joona/type_inf -> origin/dev/joona/type_inf 2025-12-04T09:17:10.8564275Z * [new branch] dev/joona/ulpAssertClose -> origin/dev/joona/ulpAssertClose 2025-12-04T09:17:10.8566397Z * [new branch] dev/joona/upsize3d -> origin/dev/joona/upsize3d 2025-12-04T09:17:10.8568256Z * [new branch] disp_counter -> origin/disp_counter 2025-12-04T09:17:10.8570240Z * [new branch] divyanshk-patch-1 -> origin/divyanshk-patch-1 2025-12-04T09:17:10.8572075Z * [new branch] docs -> origin/docs 2025-12-04T09:17:10.8574108Z * [new branch] documentation -> origin/documentation 2025-12-04T09:17:10.8576004Z * [new branch] eager_model_benchmarks -> origin/eager_model_benchmarks 2025-12-04T09:17:10.8578818Z * [new branch] embg/test_inductor_ci_control -> origin/embg/test_inductor_ci_control 2025-12-04T09:17:10.8580616Z * [new branch] embg/triton_l2_prefetch_128B -> origin/embg/triton_l2_prefetch_128B 2025-12-04T09:17:10.8582364Z * [new branch] embg/triton_l2_prefetch_256B -> origin/embg/triton_l2_prefetch_256B 2025-12-04T09:17:10.8584287Z * [new branch] eqy-patch-1 -> origin/eqy-patch-1 2025-12-04T09:17:10.8586278Z * [new branch] eqy-patch-2 -> origin/eqy-patch-2 2025-12-04T09:17:10.8588306Z * [new branch] eqy-patch-3 -> origin/eqy-patch-3 2025-12-04T09:17:10.8590373Z * [new branch] eqy-patch-4 -> origin/eqy-patch-4 2025-12-04T09:17:10.8592322Z * [new branch] eqy-patch-5 -> origin/eqy-patch-5 2025-12-04T09:17:10.8594251Z * [new branch] eqy-patch-6 -> origin/eqy-patch-6 2025-12-04T09:17:10.8596910Z * [new branch] exclamaforte/amd-ma -> origin/exclamaforte/amd-ma 2025-12-04T09:17:10.8598985Z * [new branch] exclamaforte/combo-kernels-perf-run -> origin/exclamaforte/combo-kernels-perf-run 2025-12-04T09:17:10.8601255Z * [new branch] exclamaforte/do_bench_refactor -> origin/exclamaforte/do_bench_refactor 2025-12-04T09:17:10.8603218Z * [new branch] exclamaforte/enable-mem-dep-fusion -> origin/exclamaforte/enable-mem-dep-fusion 2025-12-04T09:17:10.8605180Z * [new branch] exclamaforte/fix-exhaustive-autotuning -> origin/exclamaforte/fix-exhaustive-autotuning 2025-12-04T09:17:10.8607389Z * [new branch] exclamaforte/fix-trace-parsing-fx-svg -> origin/exclamaforte/fix-trace-parsing-fx-svg 2025-12-04T09:17:10.8609717Z * [new branch] exclamaforte/force-pointwise-cat-perf-run -> origin/exclamaforte/force-pointwise-cat-perf-run 2025-12-04T09:17:10.8611500Z * [new branch] exclamaforte/fusion-data -> origin/exclamaforte/fusion-data 2025-12-04T09:17:10.8613741Z * [new branch] exclamaforte/gemm-benchmark-run -> origin/exclamaforte/gemm-benchmark-run 2025-12-04T09:17:10.8615411Z * [new branch] exclamaforte/gemm-export-model -> origin/exclamaforte/gemm-export-model 2025-12-04T09:17:10.8617347Z * [new branch] exclamaforte/gemm-model -> origin/exclamaforte/gemm-model 2025-12-04T09:17:10.8619583Z * [new branch] exclamaforte/gemm-model-all-data-collection -> origin/exclamaforte/gemm-model-all-data-collection 2025-12-04T09:17:10.8621336Z * [new branch] exclamaforte/gemm-to-amd -> origin/exclamaforte/gemm-to-amd 2025-12-04T09:17:10.8623274Z * [new branch] exclamaforte/just-gemm-model -> origin/exclamaforte/just-gemm-model 2025-12-04T09:17:10.8625404Z * [new branch] exclamaforte/just-gemm-model-no-refactor -> origin/exclamaforte/just-gemm-model-no-refactor 2025-12-04T09:17:10.8627372Z * [new branch] exclamaforte/profile-diff-algo -> origin/exclamaforte/profile-diff-algo 2025-12-04T09:17:10.8629344Z * [new branch] exclamaforte/profiler-visualization -> origin/exclamaforte/profiler-visualization 2025-12-04T09:17:10.8631317Z * [new branch] exclamaforte/test_cpp_wrapper_mode -> origin/exclamaforte/test_cpp_wrapper_mode 2025-12-04T09:17:10.8633327Z * [new branch] exclamaforte/update-autotune-configs -> origin/exclamaforte/update-autotune-configs 2025-12-04T09:17:10.8635380Z * [new branch] exclamaforte/update-autotune-configs-2 -> origin/exclamaforte/update-autotune-configs-2 2025-12-04T09:17:10.8637174Z * [new branch] exec -> origin/exec 2025-12-04T09:17:10.8639318Z * [new branch] experimental-mosaic -> origin/experimental-mosaic 2025-12-04T09:17:10.8641512Z * [new branch] export-D61047529 -> origin/export-D61047529 2025-12-04T09:17:10.8643429Z * [new branch] export-D71412006 -> origin/export-D71412006 2025-12-04T09:17:10.8645494Z * [new branch] export-D73042989 -> origin/export-D73042989 2025-12-04T09:17:10.8647413Z * [new branch] export-D78957093 -> origin/export-D78957093 2025-12-04T09:17:10.8649361Z * [new branch] export-D78996107 -> origin/export-D78996107 2025-12-04T09:17:10.8651271Z * [new branch] export-D80823877 -> origin/export-D80823877 2025-12-04T09:17:10.8653410Z * [new branch] export-D80958642 -> origin/export-D80958642 2025-12-04T09:17:10.8655249Z * [new branch] export-D81054193 -> origin/export-D81054193 2025-12-04T09:17:10.8657185Z * [new branch] export-D81204584 -> origin/export-D81204584 2025-12-04T09:17:10.8659088Z * [new branch] export-D81429090 -> origin/export-D81429090 2025-12-04T09:17:10.8661124Z * [new branch] export-D82250826 -> origin/export-D82250826 2025-12-04T09:17:10.8663054Z * [new branch] export-D82253817 -> origin/export-D82253817 2025-12-04T09:17:10.8665523Z * [new branch] export-D83541846 -> origin/export-D83541846 2025-12-04T09:17:10.8667474Z * [new branch] export-D83627170 -> origin/export-D83627170 2025-12-04T09:17:10.8669426Z * [new branch] export-D83766701 -> origin/export-D83766701 2025-12-04T09:17:10.8671324Z * [new branch] export-D83768878 -> origin/export-D83768878 2025-12-04T09:17:10.8673283Z * [new branch] export-D83769447 -> origin/export-D83769447 2025-12-04T09:17:10.8675221Z * [new branch] export-D84089824 -> origin/export-D84089824 2025-12-04T09:17:10.8677151Z * [new branch] export-D84213020 -> origin/export-D84213020 2025-12-04T09:17:10.8679666Z * [new branch] export-D84373821 -> origin/export-D84373821 2025-12-04T09:17:10.8681807Z * [new branch] export-D84612194 -> origin/export-D84612194 2025-12-04T09:17:10.8683654Z * [new branch] export-D84890985 -> origin/export-D84890985 2025-12-04T09:17:10.8685539Z * [new branch] export-D85122326 -> origin/export-D85122326 2025-12-04T09:17:10.8687562Z * [new branch] export-D86256198 -> origin/export-D86256198 2025-12-04T09:17:10.8689485Z * [new branch] export-D86460608 -> origin/export-D86460608 2025-12-04T09:17:10.8691535Z * [new branch] export-D86474796 -> origin/export-D86474796 2025-12-04T09:17:10.8693650Z * [new branch] export-D86712396 -> origin/export-D86712396 2025-12-04T09:17:10.8695564Z * [new branch] export-D87022129 -> origin/export-D87022129 2025-12-04T09:17:10.8697570Z * [new branch] export-D87838959 -> origin/export-D87838959 2025-12-04T09:17:10.8699633Z * [new branch] export-D88319437 -> origin/export-D88319437 2025-12-04T09:17:10.8703517Z * [new branch] exported-model-train-idempotent -> origin/exported-model-train-idempotent 2025-12-04T09:17:10.8705304Z * [new branch] ezyang-titan-october -> origin/ezyang-titan-october 2025-12-04T09:17:10.8707214Z * [new branch] ezyang-titan-october2 -> origin/ezyang-titan-october2 2025-12-04T09:17:10.8709039Z * [new branch] ezyang-war -> origin/ezyang-war 2025-12-04T09:17:10.8711720Z * [new branch] ezyang/wip-aot-descriptors -> origin/ezyang/wip-aot-descriptors 2025-12-04T09:17:10.8713539Z * [new branch] fa_u8_brgemm -> origin/fa_u8_brgemm 2025-12-04T09:17:10.8716222Z * [new branch] fadeputr/sequence_fbgemm -> origin/fadeputr/sequence_fbgemm 2025-12-04T09:17:10.8718226Z * [new branch] fastmath_baseline -> origin/fastmath_baseline 2025-12-04T09:17:10.8721206Z * [new branch] fbcode/warm -> origin/fbcode/warm 2025-12-04T09:17:10.8723219Z * [new branch] fca -> origin/fca 2025-12-04T09:17:10.8725076Z * [new branch] fca2_ca5984c -> origin/fca2_ca5984c 2025-12-04T09:17:10.8726993Z * [new branch] fca5 -> origin/fca5 2025-12-04T09:17:10.8729604Z * [new branch] feature/justknobs-cpp -> origin/feature/justknobs-cpp 2025-12-04T09:17:10.8731530Z * [new branch] feature/numa-forkserver -> origin/feature/numa-forkserver 2025-12-04T09:17:10.8733869Z * [new branch] ffast_math_baseline -> origin/ffast_math_baseline 2025-12-04T09:17:10.8736377Z * [new branch] ffast_math_target -> origin/ffast_math_target 2025-12-04T09:17:10.8739420Z * [new branch] findhao/base_commit -> origin/findhao/base_commit 2025-12-04T09:17:10.8741459Z * [new branch] findhao/base_commit1 -> origin/findhao/base_commit1 2025-12-04T09:17:10.8743597Z * [new branch] findhao/multistream2 -> origin/findhao/multistream2 2025-12-04T09:17:10.8746099Z * [new branch] findhao/multistream5 -> origin/findhao/multistream5 2025-12-04T09:17:10.8749004Z * [new branch] findhao/multistream6 -> origin/findhao/multistream6 2025-12-04T09:17:10.8750918Z * [new branch] findhao/operatorbench3 -> origin/findhao/operatorbench3 2025-12-04T09:17:10.8752492Z * [new branch] findhao/operatorbench5 -> origin/findhao/operatorbench5 2025-12-04T09:17:10.8754707Z * [new branch] findhao/tritonparse -> origin/findhao/tritonparse 2025-12-04T09:17:10.8757578Z * [new branch] fix-ck-gemm-template-format -> origin/fix-ck-gemm-template-format 2025-12-04T09:17:10.8759865Z * [new branch] fix-config-ignore -> origin/fix-config-ignore 2025-12-04T09:17:10.8761762Z * [new branch] fix-dict-guard -> origin/fix-dict-guard 2025-12-04T09:17:10.8763722Z * [new branch] fix_addmm_issue -> origin/fix_addmm_issue 2025-12-04T09:17:10.8765663Z * [new branch] fix_amd_missing_cluster_dims -> origin/fix_amd_missing_cluster_dims 2025-12-04T09:17:10.8767432Z * [new branch] fix_bench_bwd_pass -> origin/fix_bench_bwd_pass 2025-12-04T09:17:10.8769505Z * [new branch] fix_mem_profiler_config -> origin/fix_mem_profiler_config 2025-12-04T09:17:10.8771363Z * [new branch] fix_nvrtc_discovery -> origin/fix_nvrtc_discovery 2025-12-04T09:17:10.8773262Z * [new branch] fix_op_runner -> origin/fix_op_runner 2025-12-04T09:17:10.8775246Z * [new branch] fix_ubn_159469 -> origin/fix_ubn_159469 2025-12-04T09:17:10.8777226Z * [new branch] fixes-triage -> origin/fixes-triage 2025-12-04T09:17:10.8779516Z * [new branch] fixflashinfer -> origin/fixflashinfer 2025-12-04T09:17:10.8781424Z * [new branch] flash_decoding_cpu -> origin/flash_decoding_cpu 2025-12-04T09:17:10.8783293Z * [new branch] flex-flash -> origin/flex-flash 2025-12-04T09:17:10.8785289Z * [new branch] flex_attention_functorch_grad -> origin/flex_attention_functorch_grad 2025-12-04T09:17:10.8787292Z * [new branch] flex_flash -> origin/flex_flash 2025-12-04T09:17:10.8790091Z * [new branch] fmassa/fix_memeff_sharding_rule -> origin/fmassa/fix_memeff_sharding_rule 2025-12-04T09:17:10.8791902Z * [new branch] fmassa/tests_comm_compute_scheduler -> origin/fmassa/tests_comm_compute_scheduler 2025-12-04T09:17:10.8793817Z * [new branch] forkserver_fix -> origin/forkserver_fix 2025-12-04T09:17:10.8795772Z * [new branch] fsdp2_trace_rules -> origin/fsdp2_trace_rules 2025-12-04T09:17:10.8797789Z * [new branch] fx_cpp -> origin/fx_cpp 2025-12-04T09:17:10.8800848Z * [new branch] fy/fix-win -> origin/fy/fix-win 2025-12-04T09:17:10.8805357Z * [new branch] galv-patch-1 -> origin/galv-patch-1 2025-12-04T09:17:10.8807680Z * [new branch] galv/cudagraphs-conditional-nodes-4 -> origin/galv/cudagraphs-conditional-nodes-4 2025-12-04T09:17:10.8810127Z * [new branch] georgehong/cmakelists-patch -> origin/georgehong/cmakelists-patch 2025-12-04T09:17:10.8814139Z * [new branch] gh/AlnisM/1/base -> origin/gh/AlnisM/1/base 2025-12-04T09:17:10.8816065Z * [new branch] gh/AlnisM/1/head -> origin/gh/AlnisM/1/head 2025-12-04T09:17:10.8819400Z * [new branch] gh/EikanWang/67/base -> origin/gh/EikanWang/67/base 2025-12-04T09:17:10.8821231Z * [new branch] gh/EikanWang/67/head -> origin/gh/EikanWang/67/head 2025-12-04T09:17:10.8824778Z * [new branch] gh/Gasoonjia/1/base -> origin/gh/Gasoonjia/1/base 2025-12-04T09:17:10.8826701Z * [new branch] gh/Gasoonjia/1/head -> origin/gh/Gasoonjia/1/head 2025-12-04T09:17:10.8829965Z * [new branch] gh/H-Huang/131/base -> origin/gh/H-Huang/131/base 2025-12-04T09:17:10.8831755Z * [new branch] gh/H-Huang/131/head -> origin/gh/H-Huang/131/head 2025-12-04T09:17:10.8833637Z * [new branch] gh/H-Huang/131/orig -> origin/gh/H-Huang/131/orig 2025-12-04T09:17:10.8836267Z * [new branch] gh/H-Huang/132/base -> origin/gh/H-Huang/132/base 2025-12-04T09:17:10.8838100Z * [new branch] gh/H-Huang/132/head -> origin/gh/H-Huang/132/head 2025-12-04T09:17:10.8840145Z * [new branch] gh/H-Huang/132/orig -> origin/gh/H-Huang/132/orig 2025-12-04T09:17:10.8843065Z * [new branch] gh/H-Huang/180/base -> origin/gh/H-Huang/180/base 2025-12-04T09:17:10.8844657Z * [new branch] gh/H-Huang/180/head -> origin/gh/H-Huang/180/head 2025-12-04T09:17:10.8846505Z * [new branch] gh/H-Huang/180/orig -> origin/gh/H-Huang/180/orig 2025-12-04T09:17:10.8848900Z * [new branch] gh/H-Huang/182/base -> origin/gh/H-Huang/182/base 2025-12-04T09:17:10.8850811Z * [new branch] gh/H-Huang/182/head -> origin/gh/H-Huang/182/head 2025-12-04T09:17:10.8852629Z * [new branch] gh/H-Huang/182/orig -> origin/gh/H-Huang/182/orig 2025-12-04T09:17:10.8855913Z * [new branch] gh/H-Huang/226/base -> origin/gh/H-Huang/226/base 2025-12-04T09:17:10.8857576Z * [new branch] gh/H-Huang/226/head -> origin/gh/H-Huang/226/head 2025-12-04T09:17:10.8859448Z * [new branch] gh/H-Huang/226/orig -> origin/gh/H-Huang/226/orig 2025-12-04T09:17:10.8862025Z * [new branch] gh/H-Huang/228/base -> origin/gh/H-Huang/228/base 2025-12-04T09:17:10.8863828Z * [new branch] gh/H-Huang/228/head -> origin/gh/H-Huang/228/head 2025-12-04T09:17:10.8865663Z * [new branch] gh/H-Huang/228/orig -> origin/gh/H-Huang/228/orig 2025-12-04T09:17:10.8868891Z * [new branch] gh/IvanKobzarev/150/base -> origin/gh/IvanKobzarev/150/base 2025-12-04T09:17:10.8870826Z * [new branch] gh/IvanKobzarev/150/head -> origin/gh/IvanKobzarev/150/head 2025-12-04T09:17:10.8872640Z * [new branch] gh/IvanKobzarev/150/orig -> origin/gh/IvanKobzarev/150/orig 2025-12-04T09:17:10.8875354Z * [new branch] gh/IvanKobzarev/157/base -> origin/gh/IvanKobzarev/157/base 2025-12-04T09:17:10.8877361Z * [new branch] gh/IvanKobzarev/157/head -> origin/gh/IvanKobzarev/157/head 2025-12-04T09:17:10.8879188Z * [new branch] gh/IvanKobzarev/157/orig -> origin/gh/IvanKobzarev/157/orig 2025-12-04T09:17:10.8881968Z * [new branch] gh/IvanKobzarev/159/base -> origin/gh/IvanKobzarev/159/base 2025-12-04T09:17:10.8883811Z * [new branch] gh/IvanKobzarev/159/head -> origin/gh/IvanKobzarev/159/head 2025-12-04T09:17:10.8885681Z * [new branch] gh/IvanKobzarev/159/orig -> origin/gh/IvanKobzarev/159/orig 2025-12-04T09:17:10.8888432Z * [new branch] gh/IvanKobzarev/162/base -> origin/gh/IvanKobzarev/162/base 2025-12-04T09:17:10.8890473Z * [new branch] gh/IvanKobzarev/162/head -> origin/gh/IvanKobzarev/162/head 2025-12-04T09:17:10.8892331Z * [new branch] gh/IvanKobzarev/162/orig -> origin/gh/IvanKobzarev/162/orig 2025-12-04T09:17:10.8894903Z * [new branch] gh/IvanKobzarev/163/base -> origin/gh/IvanKobzarev/163/base 2025-12-04T09:17:10.8896733Z * [new branch] gh/IvanKobzarev/163/head -> origin/gh/IvanKobzarev/163/head 2025-12-04T09:17:10.8898537Z * [new branch] gh/IvanKobzarev/163/orig -> origin/gh/IvanKobzarev/163/orig 2025-12-04T09:17:10.8901474Z * [new branch] gh/IvanKobzarev/166/base -> origin/gh/IvanKobzarev/166/base 2025-12-04T09:17:10.8903248Z * [new branch] gh/IvanKobzarev/166/head -> origin/gh/IvanKobzarev/166/head 2025-12-04T09:17:10.8905099Z * [new branch] gh/IvanKobzarev/166/orig -> origin/gh/IvanKobzarev/166/orig 2025-12-04T09:17:10.8907757Z * [new branch] gh/IvanKobzarev/167/base -> origin/gh/IvanKobzarev/167/base 2025-12-04T09:17:10.8909479Z * [new branch] gh/IvanKobzarev/167/head -> origin/gh/IvanKobzarev/167/head 2025-12-04T09:17:10.8911316Z * [new branch] gh/IvanKobzarev/167/orig -> origin/gh/IvanKobzarev/167/orig 2025-12-04T09:17:10.8913809Z * [new branch] gh/IvanKobzarev/168/base -> origin/gh/IvanKobzarev/168/base 2025-12-04T09:17:10.8915764Z * [new branch] gh/IvanKobzarev/168/head -> origin/gh/IvanKobzarev/168/head 2025-12-04T09:17:10.8917455Z * [new branch] gh/IvanKobzarev/168/orig -> origin/gh/IvanKobzarev/168/orig 2025-12-04T09:17:10.8920172Z * [new branch] gh/IvanKobzarev/169/base -> origin/gh/IvanKobzarev/169/base 2025-12-04T09:17:10.8922011Z * [new branch] gh/IvanKobzarev/169/head -> origin/gh/IvanKobzarev/169/head 2025-12-04T09:17:10.8923857Z * [new branch] gh/IvanKobzarev/169/orig -> origin/gh/IvanKobzarev/169/orig 2025-12-04T09:17:10.8926474Z * [new branch] gh/IvanKobzarev/170/base -> origin/gh/IvanKobzarev/170/base 2025-12-04T09:17:10.8928415Z * [new branch] gh/IvanKobzarev/170/head -> origin/gh/IvanKobzarev/170/head 2025-12-04T09:17:10.8930228Z * [new branch] gh/IvanKobzarev/170/orig -> origin/gh/IvanKobzarev/170/orig 2025-12-04T09:17:10.8933026Z * [new branch] gh/IvanKobzarev/171/base -> origin/gh/IvanKobzarev/171/base 2025-12-04T09:17:10.8934884Z * [new branch] gh/IvanKobzarev/171/head -> origin/gh/IvanKobzarev/171/head 2025-12-04T09:17:10.8937156Z * [new branch] gh/IvanKobzarev/171/orig -> origin/gh/IvanKobzarev/171/orig 2025-12-04T09:17:10.8939738Z * [new branch] gh/IvanKobzarev/172/base -> origin/gh/IvanKobzarev/172/base 2025-12-04T09:17:10.8941665Z * [new branch] gh/IvanKobzarev/172/head -> origin/gh/IvanKobzarev/172/head 2025-12-04T09:17:10.8943579Z * [new branch] gh/IvanKobzarev/172/orig -> origin/gh/IvanKobzarev/172/orig 2025-12-04T09:17:10.8946146Z * [new branch] gh/IvanKobzarev/173/base -> origin/gh/IvanKobzarev/173/base 2025-12-04T09:17:10.8947988Z * [new branch] gh/IvanKobzarev/173/head -> origin/gh/IvanKobzarev/173/head 2025-12-04T09:17:10.8949805Z * [new branch] gh/IvanKobzarev/173/orig -> origin/gh/IvanKobzarev/173/orig 2025-12-04T09:17:10.8952488Z * [new branch] gh/IvanKobzarev/174/base -> origin/gh/IvanKobzarev/174/base 2025-12-04T09:17:10.8954384Z * [new branch] gh/IvanKobzarev/174/head -> origin/gh/IvanKobzarev/174/head 2025-12-04T09:17:10.8956226Z * [new branch] gh/IvanKobzarev/174/orig -> origin/gh/IvanKobzarev/174/orig 2025-12-04T09:17:10.8958817Z * [new branch] gh/IvanKobzarev/175/base -> origin/gh/IvanKobzarev/175/base 2025-12-04T09:17:10.8960830Z * [new branch] gh/IvanKobzarev/175/head -> origin/gh/IvanKobzarev/175/head 2025-12-04T09:17:10.8962730Z * [new branch] gh/IvanKobzarev/175/orig -> origin/gh/IvanKobzarev/175/orig 2025-12-04T09:17:10.8965444Z * [new branch] gh/IvanKobzarev/176/base -> origin/gh/IvanKobzarev/176/base 2025-12-04T09:17:10.8967312Z * [new branch] gh/IvanKobzarev/176/head -> origin/gh/IvanKobzarev/176/head 2025-12-04T09:17:10.8969138Z * [new branch] gh/IvanKobzarev/176/orig -> origin/gh/IvanKobzarev/176/orig 2025-12-04T09:17:10.8972194Z * [new branch] gh/IvanKobzarev/177/base -> origin/gh/IvanKobzarev/177/base 2025-12-04T09:17:10.8974076Z * [new branch] gh/IvanKobzarev/177/head -> origin/gh/IvanKobzarev/177/head 2025-12-04T09:17:10.8976208Z * [new branch] gh/IvanKobzarev/177/orig -> origin/gh/IvanKobzarev/177/orig 2025-12-04T09:17:10.8978897Z * [new branch] gh/IvanKobzarev/178/base -> origin/gh/IvanKobzarev/178/base 2025-12-04T09:17:10.8981063Z * [new branch] gh/IvanKobzarev/178/head -> origin/gh/IvanKobzarev/178/head 2025-12-04T09:17:10.8982855Z * [new branch] gh/IvanKobzarev/178/orig -> origin/gh/IvanKobzarev/178/orig 2025-12-04T09:17:10.8985418Z * [new branch] gh/IvanKobzarev/179/base -> origin/gh/IvanKobzarev/179/base 2025-12-04T09:17:10.8987309Z * [new branch] gh/IvanKobzarev/179/head -> origin/gh/IvanKobzarev/179/head 2025-12-04T09:17:10.8989343Z * [new branch] gh/IvanKobzarev/179/orig -> origin/gh/IvanKobzarev/179/orig 2025-12-04T09:17:10.8991792Z * [new branch] gh/IvanKobzarev/180/base -> origin/gh/IvanKobzarev/180/base 2025-12-04T09:17:10.8993747Z * [new branch] gh/IvanKobzarev/180/head -> origin/gh/IvanKobzarev/180/head 2025-12-04T09:17:10.8995272Z * [new branch] gh/IvanKobzarev/180/orig -> origin/gh/IvanKobzarev/180/orig 2025-12-04T09:17:10.8998553Z * [new branch] gh/IvanKobzarev/181/base -> origin/gh/IvanKobzarev/181/base 2025-12-04T09:17:10.9000816Z * [new branch] gh/IvanKobzarev/181/head -> origin/gh/IvanKobzarev/181/head 2025-12-04T09:17:10.9017393Z * [new branch] gh/IvanKobzarev/181/orig -> origin/gh/IvanKobzarev/181/orig 2025-12-04T09:17:10.9018052Z * [new branch] gh/IvanKobzarev/182/base -> origin/gh/IvanKobzarev/182/base 2025-12-04T09:17:10.9018662Z * [new branch] gh/IvanKobzarev/182/head -> origin/gh/IvanKobzarev/182/head 2025-12-04T09:17:10.9019209Z * [new branch] gh/IvanKobzarev/182/orig -> origin/gh/IvanKobzarev/182/orig 2025-12-04T09:17:10.9019755Z * [new branch] gh/IvanKobzarev/183/base -> origin/gh/IvanKobzarev/183/base 2025-12-04T09:17:10.9020311Z * [new branch] gh/IvanKobzarev/183/head -> origin/gh/IvanKobzarev/183/head 2025-12-04T09:17:10.9020859Z * [new branch] gh/IvanKobzarev/183/orig -> origin/gh/IvanKobzarev/183/orig 2025-12-04T09:17:10.9021411Z * [new branch] gh/IvanKobzarev/184/base -> origin/gh/IvanKobzarev/184/base 2025-12-04T09:17:10.9021965Z * [new branch] gh/IvanKobzarev/184/head -> origin/gh/IvanKobzarev/184/head 2025-12-04T09:17:10.9022505Z * [new branch] gh/IvanKobzarev/184/orig -> origin/gh/IvanKobzarev/184/orig 2025-12-04T09:17:10.9025172Z * [new branch] gh/NikhilAPatel/1/base -> origin/gh/NikhilAPatel/1/base 2025-12-04T09:17:10.9027068Z * [new branch] gh/NikhilAPatel/1/head -> origin/gh/NikhilAPatel/1/head 2025-12-04T09:17:10.9029489Z * [new branch] gh/NikhilAPatel/2/base -> origin/gh/NikhilAPatel/2/base 2025-12-04T09:17:10.9031263Z * [new branch] gh/NikhilAPatel/2/head -> origin/gh/NikhilAPatel/2/head 2025-12-04T09:17:10.9033939Z * [new branch] gh/NikhilAPatel/4/base -> origin/gh/NikhilAPatel/4/base 2025-12-04T09:17:10.9035903Z * [new branch] gh/NikhilAPatel/4/head -> origin/gh/NikhilAPatel/4/head 2025-12-04T09:17:10.9038487Z * [new branch] gh/NikhilAPatel/5/base -> origin/gh/NikhilAPatel/5/base 2025-12-04T09:17:10.9040467Z * [new branch] gh/NikhilAPatel/5/head -> origin/gh/NikhilAPatel/5/head 2025-12-04T09:17:10.9042441Z * [new branch] gh/NikhilAPatel/5/orig -> origin/gh/NikhilAPatel/5/orig 2025-12-04T09:17:10.9045286Z * [new branch] gh/PaliC/17/base -> origin/gh/PaliC/17/base 2025-12-04T09:17:10.9047121Z * [new branch] gh/PaliC/17/head -> origin/gh/PaliC/17/head 2025-12-04T09:17:10.9048987Z * [new branch] gh/PaliC/17/orig -> origin/gh/PaliC/17/orig 2025-12-04T09:17:10.9051539Z * [new branch] gh/PaliC/18/base -> origin/gh/PaliC/18/base 2025-12-04T09:17:10.9053363Z * [new branch] gh/PaliC/18/head -> origin/gh/PaliC/18/head 2025-12-04T09:17:10.9055222Z * [new branch] gh/PaliC/18/orig -> origin/gh/PaliC/18/orig 2025-12-04T09:17:10.9057720Z * [new branch] gh/PaliC/20/base -> origin/gh/PaliC/20/base 2025-12-04T09:17:10.9059561Z * [new branch] gh/PaliC/20/head -> origin/gh/PaliC/20/head 2025-12-04T09:17:10.9061417Z * [new branch] gh/PaliC/20/orig -> origin/gh/PaliC/20/orig 2025-12-04T09:17:10.9063954Z * [new branch] gh/PaliC/21/base -> origin/gh/PaliC/21/base 2025-12-04T09:17:10.9065989Z * [new branch] gh/PaliC/21/head -> origin/gh/PaliC/21/head 2025-12-04T09:17:10.9067615Z * [new branch] gh/PaliC/21/orig -> origin/gh/PaliC/21/orig 2025-12-04T09:17:10.9070140Z * [new branch] gh/PaliC/23/base -> origin/gh/PaliC/23/base 2025-12-04T09:17:10.9071925Z * [new branch] gh/PaliC/23/head -> origin/gh/PaliC/23/head 2025-12-04T09:17:10.9073730Z * [new branch] gh/PaliC/23/orig -> origin/gh/PaliC/23/orig 2025-12-04T09:17:10.9076218Z * [new branch] gh/PaliC/24/base -> origin/gh/PaliC/24/base 2025-12-04T09:17:10.9078105Z * [new branch] gh/PaliC/24/head -> origin/gh/PaliC/24/head 2025-12-04T09:17:10.9080067Z * [new branch] gh/PaliC/24/orig -> origin/gh/PaliC/24/orig 2025-12-04T09:17:10.9082560Z * [new branch] gh/PaliC/25/head -> origin/gh/PaliC/25/head 2025-12-04T09:17:10.9084395Z * [new branch] gh/PaliC/25/next -> origin/gh/PaliC/25/next 2025-12-04T09:17:10.9086243Z * [new branch] gh/PaliC/25/orig -> origin/gh/PaliC/25/orig 2025-12-04T09:17:10.9088778Z * [new branch] gh/PaliC/26/head -> origin/gh/PaliC/26/head 2025-12-04T09:17:10.9090506Z * [new branch] gh/PaliC/26/next -> origin/gh/PaliC/26/next 2025-12-04T09:17:10.9092268Z * [new branch] gh/PaliC/26/orig -> origin/gh/PaliC/26/orig 2025-12-04T09:17:10.9094881Z * [new branch] gh/PaliC/27/next -> origin/gh/PaliC/27/next 2025-12-04T09:17:10.9097318Z * [new branch] gh/PaliC/28/head -> origin/gh/PaliC/28/head 2025-12-04T09:17:10.9099154Z * [new branch] gh/PaliC/28/next -> origin/gh/PaliC/28/next 2025-12-04T09:17:10.9101166Z * [new branch] gh/PaliC/28/orig -> origin/gh/PaliC/28/orig 2025-12-04T09:17:10.9103707Z * [new branch] gh/PaliC/29/head -> origin/gh/PaliC/29/head 2025-12-04T09:17:10.9105444Z * [new branch] gh/PaliC/29/next -> origin/gh/PaliC/29/next 2025-12-04T09:17:10.9107269Z * [new branch] gh/PaliC/29/orig -> origin/gh/PaliC/29/orig 2025-12-04T09:17:10.9109920Z * [new branch] gh/PaliC/30/head -> origin/gh/PaliC/30/head 2025-12-04T09:17:10.9111621Z * [new branch] gh/PaliC/30/next -> origin/gh/PaliC/30/next 2025-12-04T09:17:10.9113487Z * [new branch] gh/PaliC/30/orig -> origin/gh/PaliC/30/orig 2025-12-04T09:17:10.9115961Z * [new branch] gh/PaliC/31/head -> origin/gh/PaliC/31/head 2025-12-04T09:17:10.9117689Z * [new branch] gh/PaliC/31/next -> origin/gh/PaliC/31/next 2025-12-04T09:17:10.9119605Z * [new branch] gh/PaliC/31/orig -> origin/gh/PaliC/31/orig 2025-12-04T09:17:10.9122687Z * [new branch] gh/PaulZhang12/25/base -> origin/gh/PaulZhang12/25/base 2025-12-04T09:17:10.9124665Z * [new branch] gh/PaulZhang12/25/head -> origin/gh/PaulZhang12/25/head 2025-12-04T09:17:10.9126511Z * [new branch] gh/PaulZhang12/25/orig -> origin/gh/PaulZhang12/25/orig 2025-12-04T09:17:10.9129084Z * [new branch] gh/PaulZhang12/28/base -> origin/gh/PaulZhang12/28/base 2025-12-04T09:17:10.9130941Z * [new branch] gh/PaulZhang12/28/head -> origin/gh/PaulZhang12/28/head 2025-12-04T09:17:10.9132793Z * [new branch] gh/PaulZhang12/28/orig -> origin/gh/PaulZhang12/28/orig 2025-12-04T09:17:10.9135616Z * [new branch] gh/PaulZhang12/31/base -> origin/gh/PaulZhang12/31/base 2025-12-04T09:17:10.9138480Z * [new branch] gh/PaulZhang12/31/head -> origin/gh/PaulZhang12/31/head 2025-12-04T09:17:10.9139769Z * [new branch] gh/PaulZhang12/31/orig -> origin/gh/PaulZhang12/31/orig 2025-12-04T09:17:10.9141591Z * [new branch] gh/PaulZhang12/37/base -> origin/gh/PaulZhang12/37/base 2025-12-04T09:17:10.9143541Z * [new branch] gh/PaulZhang12/37/head -> origin/gh/PaulZhang12/37/head 2025-12-04T09:17:10.9145290Z * [new branch] gh/PaulZhang12/37/orig -> origin/gh/PaulZhang12/37/orig 2025-12-04T09:17:10.9147969Z * [new branch] gh/PaulZhang12/40/base -> origin/gh/PaulZhang12/40/base 2025-12-04T09:17:10.9149768Z * [new branch] gh/PaulZhang12/40/head -> origin/gh/PaulZhang12/40/head 2025-12-04T09:17:10.9151585Z * [new branch] gh/PaulZhang12/40/orig -> origin/gh/PaulZhang12/40/orig 2025-12-04T09:17:10.9154132Z * [new branch] gh/PaulZhang12/42/base -> origin/gh/PaulZhang12/42/base 2025-12-04T09:17:10.9155965Z * [new branch] gh/PaulZhang12/42/head -> origin/gh/PaulZhang12/42/head 2025-12-04T09:17:10.9158550Z * [new branch] gh/PaulZhang12/43/base -> origin/gh/PaulZhang12/43/base 2025-12-04T09:17:10.9160560Z * [new branch] gh/PaulZhang12/43/head -> origin/gh/PaulZhang12/43/head 2025-12-04T09:17:10.9162374Z * [new branch] gh/PaulZhang12/43/orig -> origin/gh/PaulZhang12/43/orig 2025-12-04T09:17:10.9164778Z * [new branch] gh/PaulZhang12/44/base -> origin/gh/PaulZhang12/44/base 2025-12-04T09:17:10.9166732Z * [new branch] gh/PaulZhang12/44/head -> origin/gh/PaulZhang12/44/head 2025-12-04T09:17:10.9169620Z * [new branch] gh/PaulZhang12/45/base -> origin/gh/PaulZhang12/45/base 2025-12-04T09:17:10.9171410Z * [new branch] gh/PaulZhang12/45/head -> origin/gh/PaulZhang12/45/head 2025-12-04T09:17:10.9173226Z * [new branch] gh/PaulZhang12/45/orig -> origin/gh/PaulZhang12/45/orig 2025-12-04T09:17:10.9175813Z * [new branch] gh/PaulZhang12/46/base -> origin/gh/PaulZhang12/46/base 2025-12-04T09:17:10.9177834Z * [new branch] gh/PaulZhang12/46/head -> origin/gh/PaulZhang12/46/head 2025-12-04T09:17:10.9179738Z * [new branch] gh/PaulZhang12/46/orig -> origin/gh/PaulZhang12/46/orig 2025-12-04T09:17:10.9182338Z * [new branch] gh/PaulZhang12/47/base -> origin/gh/PaulZhang12/47/base 2025-12-04T09:17:10.9184186Z * [new branch] gh/PaulZhang12/47/head -> origin/gh/PaulZhang12/47/head 2025-12-04T09:17:10.9186260Z * [new branch] gh/PaulZhang12/47/orig -> origin/gh/PaulZhang12/47/orig 2025-12-04T09:17:10.9189075Z * [new branch] gh/PaulZhang12/48/base -> origin/gh/PaulZhang12/48/base 2025-12-04T09:17:10.9190941Z * [new branch] gh/PaulZhang12/48/head -> origin/gh/PaulZhang12/48/head 2025-12-04T09:17:10.9192752Z * [new branch] gh/PaulZhang12/48/orig -> origin/gh/PaulZhang12/48/orig 2025-12-04T09:17:10.9195785Z * [new branch] gh/SamGinzburg/11/base -> origin/gh/SamGinzburg/11/base 2025-12-04T09:17:10.9197754Z * [new branch] gh/SamGinzburg/11/head -> origin/gh/SamGinzburg/11/head 2025-12-04T09:17:10.9200984Z * [new branch] gh/SherlockNoMad/1/base -> origin/gh/SherlockNoMad/1/base 2025-12-04T09:17:10.9203083Z * [new branch] gh/SherlockNoMad/1/head -> origin/gh/SherlockNoMad/1/head 2025-12-04T09:17:10.9205662Z * [new branch] gh/SherlockNoMad/10/base -> origin/gh/SherlockNoMad/10/base 2025-12-04T09:17:10.9207612Z * [new branch] gh/SherlockNoMad/10/head -> origin/gh/SherlockNoMad/10/head 2025-12-04T09:17:10.9209492Z * [new branch] gh/SherlockNoMad/10/orig -> origin/gh/SherlockNoMad/10/orig 2025-12-04T09:17:10.9211913Z * [new branch] gh/SherlockNoMad/11/base -> origin/gh/SherlockNoMad/11/base 2025-12-04T09:17:10.9213819Z * [new branch] gh/SherlockNoMad/11/head -> origin/gh/SherlockNoMad/11/head 2025-12-04T09:17:10.9215818Z * [new branch] gh/SherlockNoMad/11/orig -> origin/gh/SherlockNoMad/11/orig 2025-12-04T09:17:10.9218164Z * [new branch] gh/SherlockNoMad/12/base -> origin/gh/SherlockNoMad/12/base 2025-12-04T09:17:10.9219908Z * [new branch] gh/SherlockNoMad/12/head -> origin/gh/SherlockNoMad/12/head 2025-12-04T09:17:10.9221773Z * [new branch] gh/SherlockNoMad/12/orig -> origin/gh/SherlockNoMad/12/orig 2025-12-04T09:17:10.9224351Z * [new branch] gh/SherlockNoMad/15/base -> origin/gh/SherlockNoMad/15/base 2025-12-04T09:17:10.9226222Z * [new branch] gh/SherlockNoMad/15/head -> origin/gh/SherlockNoMad/15/head 2025-12-04T09:17:10.9228098Z * [new branch] gh/SherlockNoMad/15/orig -> origin/gh/SherlockNoMad/15/orig 2025-12-04T09:17:10.9230594Z * [new branch] gh/SherlockNoMad/17/base -> origin/gh/SherlockNoMad/17/base 2025-12-04T09:17:10.9232412Z * [new branch] gh/SherlockNoMad/17/head -> origin/gh/SherlockNoMad/17/head 2025-12-04T09:17:10.9234246Z * [new branch] gh/SherlockNoMad/17/orig -> origin/gh/SherlockNoMad/17/orig 2025-12-04T09:17:10.9236970Z * [new branch] gh/SherlockNoMad/18/base -> origin/gh/SherlockNoMad/18/base 2025-12-04T09:17:10.9238867Z * [new branch] gh/SherlockNoMad/18/head -> origin/gh/SherlockNoMad/18/head 2025-12-04T09:17:10.9240839Z * [new branch] gh/SherlockNoMad/18/orig -> origin/gh/SherlockNoMad/18/orig 2025-12-04T09:17:10.9243232Z * [new branch] gh/SherlockNoMad/19/base -> origin/gh/SherlockNoMad/19/base 2025-12-04T09:17:10.9245111Z * [new branch] gh/SherlockNoMad/19/head -> origin/gh/SherlockNoMad/19/head 2025-12-04T09:17:10.9246981Z * [new branch] gh/SherlockNoMad/19/orig -> origin/gh/SherlockNoMad/19/orig 2025-12-04T09:17:10.9249407Z * [new branch] gh/SherlockNoMad/2/base -> origin/gh/SherlockNoMad/2/base 2025-12-04T09:17:10.9251213Z * [new branch] gh/SherlockNoMad/2/head -> origin/gh/SherlockNoMad/2/head 2025-12-04T09:17:10.9253517Z * [new branch] gh/SherlockNoMad/20/base -> origin/gh/SherlockNoMad/20/base 2025-12-04T09:17:10.9255535Z * [new branch] gh/SherlockNoMad/20/head -> origin/gh/SherlockNoMad/20/head 2025-12-04T09:17:10.9257330Z * [new branch] gh/SherlockNoMad/20/orig -> origin/gh/SherlockNoMad/20/orig 2025-12-04T09:17:10.9260002Z * [new branch] gh/SherlockNoMad/21/base -> origin/gh/SherlockNoMad/21/base 2025-12-04T09:17:10.9262004Z * [new branch] gh/SherlockNoMad/21/head -> origin/gh/SherlockNoMad/21/head 2025-12-04T09:17:10.9263726Z * [new branch] gh/SherlockNoMad/21/orig -> origin/gh/SherlockNoMad/21/orig 2025-12-04T09:17:10.9266188Z * [new branch] gh/SherlockNoMad/3/base -> origin/gh/SherlockNoMad/3/base 2025-12-04T09:17:10.9268031Z * [new branch] gh/SherlockNoMad/3/head -> origin/gh/SherlockNoMad/3/head 2025-12-04T09:17:10.9270379Z * [new branch] gh/SherlockNoMad/4/base -> origin/gh/SherlockNoMad/4/base 2025-12-04T09:17:10.9272121Z * [new branch] gh/SherlockNoMad/4/head -> origin/gh/SherlockNoMad/4/head 2025-12-04T09:17:10.9274479Z * [new branch] gh/SherlockNoMad/5/base -> origin/gh/SherlockNoMad/5/base 2025-12-04T09:17:10.9276219Z * [new branch] gh/SherlockNoMad/5/head -> origin/gh/SherlockNoMad/5/head 2025-12-04T09:17:10.9280079Z * [new branch] gh/Sidharth123-cpu/24/base -> origin/gh/Sidharth123-cpu/24/base 2025-12-04T09:17:10.9282516Z * [new branch] gh/Sidharth123-cpu/25/base -> origin/gh/Sidharth123-cpu/25/base 2025-12-04T09:17:10.9284877Z * [new branch] gh/Sidharth123-cpu/26/base -> origin/gh/Sidharth123-cpu/26/base 2025-12-04T09:17:10.9287589Z * [new branch] gh/Sidharth123-cpu/27/base -> origin/gh/Sidharth123-cpu/27/base 2025-12-04T09:17:10.9290904Z * [new branch] gh/StrongerXi/1/base -> origin/gh/StrongerXi/1/base 2025-12-04T09:17:10.9292617Z * [new branch] gh/StrongerXi/1/head -> origin/gh/StrongerXi/1/head 2025-12-04T09:17:10.9295083Z * [new branch] gh/StrongerXi/71/base -> origin/gh/StrongerXi/71/base 2025-12-04T09:17:10.9296948Z * [new branch] gh/StrongerXi/71/head -> origin/gh/StrongerXi/71/head 2025-12-04T09:17:10.9299351Z * [new branch] gh/StrongerXi/72/base -> origin/gh/StrongerXi/72/base 2025-12-04T09:17:10.9301189Z * [new branch] gh/StrongerXi/72/head -> origin/gh/StrongerXi/72/head 2025-12-04T09:17:10.9303975Z * [new branch] gh/StrongerXi/73/base -> origin/gh/StrongerXi/73/base 2025-12-04T09:17:10.9305825Z * [new branch] gh/StrongerXi/73/head -> origin/gh/StrongerXi/73/head 2025-12-04T09:17:10.9307948Z * [new branch] gh/StrongerXi/73/orig -> origin/gh/StrongerXi/73/orig 2025-12-04T09:17:10.9311314Z * [new branch] gh/XilunWu/160/base -> origin/gh/XilunWu/160/base 2025-12-04T09:17:10.9313052Z * [new branch] gh/XilunWu/160/head -> origin/gh/XilunWu/160/head 2025-12-04T09:17:10.9314880Z * [new branch] gh/XilunWu/160/orig -> origin/gh/XilunWu/160/orig 2025-12-04T09:17:10.9317439Z * [new branch] gh/XilunWu/163/base -> origin/gh/XilunWu/163/base 2025-12-04T09:17:10.9319384Z * [new branch] gh/XilunWu/163/head -> origin/gh/XilunWu/163/head 2025-12-04T09:17:10.9321446Z * [new branch] gh/XilunWu/163/orig -> origin/gh/XilunWu/163/orig 2025-12-04T09:17:10.9324041Z * [new branch] gh/XilunWu/168/base -> origin/gh/XilunWu/168/base 2025-12-04T09:17:10.9325900Z * [new branch] gh/XilunWu/168/head -> origin/gh/XilunWu/168/head 2025-12-04T09:17:10.9327627Z * [new branch] gh/XilunWu/168/orig -> origin/gh/XilunWu/168/orig 2025-12-04T09:17:10.9330138Z * [new branch] gh/XilunWu/169/base -> origin/gh/XilunWu/169/base 2025-12-04T09:17:10.9331981Z * [new branch] gh/XilunWu/169/head -> origin/gh/XilunWu/169/head 2025-12-04T09:17:10.9333844Z * [new branch] gh/XilunWu/169/orig -> origin/gh/XilunWu/169/orig 2025-12-04T09:17:10.9336131Z * [new branch] gh/XilunWu/170/base -> origin/gh/XilunWu/170/base 2025-12-04T09:17:10.9338415Z * [new branch] gh/XilunWu/170/head -> origin/gh/XilunWu/170/head 2025-12-04T09:17:10.9339671Z * [new branch] gh/XilunWu/170/orig -> origin/gh/XilunWu/170/orig 2025-12-04T09:17:10.9342552Z * [new branch] gh/XilunWu/171/base -> origin/gh/XilunWu/171/base 2025-12-04T09:17:10.9344454Z * [new branch] gh/XilunWu/171/head -> origin/gh/XilunWu/171/head 2025-12-04T09:17:10.9346387Z * [new branch] gh/XilunWu/171/orig -> origin/gh/XilunWu/171/orig 2025-12-04T09:17:10.9348875Z * [new branch] gh/XilunWu/173/base -> origin/gh/XilunWu/173/base 2025-12-04T09:17:10.9350771Z * [new branch] gh/XilunWu/173/head -> origin/gh/XilunWu/173/head 2025-12-04T09:17:10.9352579Z * [new branch] gh/XilunWu/173/orig -> origin/gh/XilunWu/173/orig 2025-12-04T09:17:10.9355079Z * [new branch] gh/XilunWu/175/base -> origin/gh/XilunWu/175/base 2025-12-04T09:17:10.9356961Z * [new branch] gh/XilunWu/175/head -> origin/gh/XilunWu/175/head 2025-12-04T09:17:10.9358857Z * [new branch] gh/XilunWu/175/orig -> origin/gh/XilunWu/175/orig 2025-12-04T09:17:10.9361626Z * [new branch] gh/XilunWu/176/base -> origin/gh/XilunWu/176/base 2025-12-04T09:17:10.9363453Z * [new branch] gh/XilunWu/176/head -> origin/gh/XilunWu/176/head 2025-12-04T09:17:10.9365513Z * [new branch] gh/XilunWu/176/orig -> origin/gh/XilunWu/176/orig 2025-12-04T09:17:10.9368613Z * [new branch] gh/XuehaiPan/14/base -> origin/gh/XuehaiPan/14/base 2025-12-04T09:17:10.9370348Z * [new branch] gh/XuehaiPan/14/head -> origin/gh/XuehaiPan/14/head 2025-12-04T09:17:10.9372195Z * [new branch] gh/XuehaiPan/14/orig -> origin/gh/XuehaiPan/14/orig 2025-12-04T09:17:10.9374785Z * [new branch] gh/XuehaiPan/179/base -> origin/gh/XuehaiPan/179/base 2025-12-04T09:17:10.9376577Z * [new branch] gh/XuehaiPan/179/head -> origin/gh/XuehaiPan/179/head 2025-12-04T09:17:10.9378566Z * [new branch] gh/XuehaiPan/179/orig -> origin/gh/XuehaiPan/179/orig 2025-12-04T09:17:10.9381109Z * [new branch] gh/XuehaiPan/249/base -> origin/gh/XuehaiPan/249/base 2025-12-04T09:17:10.9382949Z * [new branch] gh/XuehaiPan/249/head -> origin/gh/XuehaiPan/249/head 2025-12-04T09:17:10.9385120Z * [new branch] gh/XuehaiPan/249/orig -> origin/gh/XuehaiPan/249/orig 2025-12-04T09:17:10.9387449Z * [new branch] gh/XuehaiPan/253/base -> origin/gh/XuehaiPan/253/base 2025-12-04T09:17:10.9389246Z * [new branch] gh/XuehaiPan/253/head -> origin/gh/XuehaiPan/253/head 2025-12-04T09:17:10.9391105Z * [new branch] gh/XuehaiPan/253/orig -> origin/gh/XuehaiPan/253/orig 2025-12-04T09:17:10.9393662Z * [new branch] gh/XuehaiPan/254/base -> origin/gh/XuehaiPan/254/base 2025-12-04T09:17:10.9395565Z * [new branch] gh/XuehaiPan/254/head -> origin/gh/XuehaiPan/254/head 2025-12-04T09:17:10.9397422Z * [new branch] gh/XuehaiPan/254/orig -> origin/gh/XuehaiPan/254/orig 2025-12-04T09:17:10.9399929Z * [new branch] gh/XuehaiPan/255/base -> origin/gh/XuehaiPan/255/base 2025-12-04T09:17:10.9402086Z * [new branch] gh/XuehaiPan/255/head -> origin/gh/XuehaiPan/255/head 2025-12-04T09:17:10.9403861Z * [new branch] gh/XuehaiPan/255/orig -> origin/gh/XuehaiPan/255/orig 2025-12-04T09:17:10.9406427Z * [new branch] gh/XuehaiPan/271/base -> origin/gh/XuehaiPan/271/base 2025-12-04T09:17:10.9408126Z * [new branch] gh/XuehaiPan/271/head -> origin/gh/XuehaiPan/271/head 2025-12-04T09:17:10.9410069Z * [new branch] gh/XuehaiPan/271/orig -> origin/gh/XuehaiPan/271/orig 2025-12-04T09:17:10.9412600Z * [new branch] gh/XuehaiPan/343/base -> origin/gh/XuehaiPan/343/base 2025-12-04T09:17:10.9414508Z * [new branch] gh/XuehaiPan/343/head -> origin/gh/XuehaiPan/343/head 2025-12-04T09:17:10.9416371Z * [new branch] gh/XuehaiPan/343/orig -> origin/gh/XuehaiPan/343/orig 2025-12-04T09:17:10.9418931Z * [new branch] gh/XuehaiPan/347/base -> origin/gh/XuehaiPan/347/base 2025-12-04T09:17:10.9420805Z * [new branch] gh/XuehaiPan/347/head -> origin/gh/XuehaiPan/347/head 2025-12-04T09:17:10.9422581Z * [new branch] gh/XuehaiPan/347/orig -> origin/gh/XuehaiPan/347/orig 2025-12-04T09:17:10.9425074Z * [new branch] gh/XuehaiPan/348/base -> origin/gh/XuehaiPan/348/base 2025-12-04T09:17:10.9426850Z * [new branch] gh/XuehaiPan/348/head -> origin/gh/XuehaiPan/348/head 2025-12-04T09:17:10.9428787Z * [new branch] gh/XuehaiPan/348/orig -> origin/gh/XuehaiPan/348/orig 2025-12-04T09:17:10.9431339Z * [new branch] gh/XuehaiPan/350/base -> origin/gh/XuehaiPan/350/base 2025-12-04T09:17:10.9433143Z * [new branch] gh/XuehaiPan/350/head -> origin/gh/XuehaiPan/350/head 2025-12-04T09:17:10.9435048Z * [new branch] gh/XuehaiPan/350/orig -> origin/gh/XuehaiPan/350/orig 2025-12-04T09:17:10.9437685Z * [new branch] gh/XuehaiPan/365/base -> origin/gh/XuehaiPan/365/base 2025-12-04T09:17:10.9439521Z * [new branch] gh/XuehaiPan/365/head -> origin/gh/XuehaiPan/365/head 2025-12-04T09:17:10.9441501Z * [new branch] gh/XuehaiPan/365/orig -> origin/gh/XuehaiPan/365/orig 2025-12-04T09:17:10.9444033Z * [new branch] gh/XuehaiPan/366/base -> origin/gh/XuehaiPan/366/base 2025-12-04T09:17:10.9445879Z * [new branch] gh/XuehaiPan/366/head -> origin/gh/XuehaiPan/366/head 2025-12-04T09:17:10.9448415Z * [new branch] gh/XuehaiPan/370/base -> origin/gh/XuehaiPan/370/base 2025-12-04T09:17:10.9450235Z * [new branch] gh/XuehaiPan/370/head -> origin/gh/XuehaiPan/370/head 2025-12-04T09:17:10.9452071Z * [new branch] gh/XuehaiPan/370/orig -> origin/gh/XuehaiPan/370/orig 2025-12-04T09:17:10.9454665Z * [new branch] gh/XuehaiPan/390/base -> origin/gh/XuehaiPan/390/base 2025-12-04T09:17:10.9456468Z * [new branch] gh/XuehaiPan/390/head -> origin/gh/XuehaiPan/390/head 2025-12-04T09:17:10.9458411Z * [new branch] gh/XuehaiPan/390/orig -> origin/gh/XuehaiPan/390/orig 2025-12-04T09:17:10.9460893Z * [new branch] gh/XuehaiPan/391/base -> origin/gh/XuehaiPan/391/base 2025-12-04T09:17:10.9462745Z * [new branch] gh/XuehaiPan/391/head -> origin/gh/XuehaiPan/391/head 2025-12-04T09:17:10.9464583Z * [new branch] gh/XuehaiPan/391/orig -> origin/gh/XuehaiPan/391/orig 2025-12-04T09:17:10.9467021Z * [new branch] gh/XuehaiPan/392/base -> origin/gh/XuehaiPan/392/base 2025-12-04T09:17:10.9468921Z * [new branch] gh/XuehaiPan/392/head -> origin/gh/XuehaiPan/392/head 2025-12-04T09:17:10.9470684Z * [new branch] gh/XuehaiPan/392/orig -> origin/gh/XuehaiPan/392/orig 2025-12-04T09:17:10.9473785Z * [new branch] gh/XuehaiPan/394/base -> origin/gh/XuehaiPan/394/base 2025-12-04T09:17:10.9475655Z * [new branch] gh/XuehaiPan/394/head -> origin/gh/XuehaiPan/394/head 2025-12-04T09:17:10.9477517Z * [new branch] gh/XuehaiPan/394/orig -> origin/gh/XuehaiPan/394/orig 2025-12-04T09:17:10.9480285Z * [new branch] gh/XuehaiPan/397/base -> origin/gh/XuehaiPan/397/base 2025-12-04T09:17:10.9482037Z * [new branch] gh/XuehaiPan/397/head -> origin/gh/XuehaiPan/397/head 2025-12-04T09:17:10.9483907Z * [new branch] gh/XuehaiPan/397/orig -> origin/gh/XuehaiPan/397/orig 2025-12-04T09:17:10.9486513Z * [new branch] gh/XuehaiPan/398/base -> origin/gh/XuehaiPan/398/base 2025-12-04T09:17:10.9488322Z * [new branch] gh/XuehaiPan/398/head -> origin/gh/XuehaiPan/398/head 2025-12-04T09:17:10.9490194Z * [new branch] gh/XuehaiPan/398/orig -> origin/gh/XuehaiPan/398/orig 2025-12-04T09:17:10.9492702Z * [new branch] gh/XuehaiPan/399/base -> origin/gh/XuehaiPan/399/base 2025-12-04T09:17:10.9494585Z * [new branch] gh/XuehaiPan/399/head -> origin/gh/XuehaiPan/399/head 2025-12-04T09:17:10.9496445Z * [new branch] gh/XuehaiPan/399/orig -> origin/gh/XuehaiPan/399/orig 2025-12-04T09:17:10.9499117Z * [new branch] gh/XuehaiPan/400/base -> origin/gh/XuehaiPan/400/base 2025-12-04T09:17:10.9501106Z * [new branch] gh/XuehaiPan/400/head -> origin/gh/XuehaiPan/400/head 2025-12-04T09:17:10.9503072Z * [new branch] gh/XuehaiPan/400/orig -> origin/gh/XuehaiPan/400/orig 2025-12-04T09:17:10.9506196Z * [new branch] gh/ZhiweiYan-96/39/base -> origin/gh/ZhiweiYan-96/39/base 2025-12-04T09:17:10.9508194Z * [new branch] gh/ZhiweiYan-96/39/head -> origin/gh/ZhiweiYan-96/39/head 2025-12-04T09:17:10.9509854Z * [new branch] gh/ZhiweiYan-96/39/orig -> origin/gh/ZhiweiYan-96/39/orig 2025-12-04T09:17:10.9512812Z * [new branch] gh/ZhiweiYan-96/44/base -> origin/gh/ZhiweiYan-96/44/base 2025-12-04T09:17:10.9514064Z * [new branch] gh/ZhiweiYan-96/44/head -> origin/gh/ZhiweiYan-96/44/head 2025-12-04T09:17:10.9516674Z * [new branch] gh/ZhiweiYan-96/45/base -> origin/gh/ZhiweiYan-96/45/base 2025-12-04T09:17:10.9518480Z * [new branch] gh/ZhiweiYan-96/45/head -> origin/gh/ZhiweiYan-96/45/head 2025-12-04T09:17:10.9521332Z * [new branch] gh/ZhiweiYan-96/49/base -> origin/gh/ZhiweiYan-96/49/base 2025-12-04T09:17:10.9523072Z * [new branch] gh/ZhiweiYan-96/49/head -> origin/gh/ZhiweiYan-96/49/head 2025-12-04T09:17:10.9525622Z * [new branch] gh/ZhiweiYan-96/62/base -> origin/gh/ZhiweiYan-96/62/base 2025-12-04T09:17:10.9527459Z * [new branch] gh/ZhiweiYan-96/62/head -> origin/gh/ZhiweiYan-96/62/head 2025-12-04T09:17:10.9530124Z * [new branch] gh/ZhiweiYan-96/66/base -> origin/gh/ZhiweiYan-96/66/base 2025-12-04T09:17:10.9531984Z * [new branch] gh/ZhiweiYan-96/66/head -> origin/gh/ZhiweiYan-96/66/head 2025-12-04T09:17:10.9534327Z * [new branch] gh/ZhiweiYan-96/67/base -> origin/gh/ZhiweiYan-96/67/base 2025-12-04T09:17:10.9536225Z * [new branch] gh/ZhiweiYan-96/67/head -> origin/gh/ZhiweiYan-96/67/head 2025-12-04T09:17:10.9538619Z * [new branch] gh/ZhiweiYan-96/68/base -> origin/gh/ZhiweiYan-96/68/base 2025-12-04T09:17:10.9540416Z * [new branch] gh/ZhiweiYan-96/68/head -> origin/gh/ZhiweiYan-96/68/head 2025-12-04T09:17:10.9542342Z * [new branch] gh/ZhiweiYan-96/68/orig -> origin/gh/ZhiweiYan-96/68/orig 2025-12-04T09:17:10.9545389Z * [new branch] gh/aakhundov/1/base -> origin/gh/aakhundov/1/base 2025-12-04T09:17:10.9547354Z * [new branch] gh/aakhundov/1/head -> origin/gh/aakhundov/1/head 2025-12-04T09:17:10.9549748Z * [new branch] gh/aakhundov/2/base -> origin/gh/aakhundov/2/base 2025-12-04T09:17:10.9551664Z * [new branch] gh/aakhundov/2/head -> origin/gh/aakhundov/2/head 2025-12-04T09:17:10.9554053Z * [new branch] gh/aditew01/openblas -> origin/gh/aditew01/openblas 2025-12-04T09:17:10.9555931Z * [new branch] gh/aditew01/sbgemm -> origin/gh/aditew01/sbgemm 2025-12-04T09:17:10.9557833Z * [new branch] gh/aditew01/vecbf16 -> origin/gh/aditew01/vecbf16 2025-12-04T09:17:10.9560878Z * [new branch] gh/albanD/4/base -> origin/gh/albanD/4/base 2025-12-04T09:17:10.9562765Z * [new branch] gh/albanD/4/head -> origin/gh/albanD/4/head 2025-12-04T09:17:10.9564600Z * [new branch] gh/albanD/4/orig -> origin/gh/albanD/4/orig 2025-12-04T09:17:10.9567543Z * [new branch] gh/alexbrauckmann/paddedtensor_faketensor_init -> origin/gh/alexbrauckmann/paddedtensor_faketensor_init 2025-12-04T09:17:10.9570343Z * [new branch] gh/alexsamardzic/12/base -> origin/gh/alexsamardzic/12/base 2025-12-04T09:17:10.9572361Z * [new branch] gh/alexsamardzic/12/head -> origin/gh/alexsamardzic/12/head 2025-12-04T09:17:10.9574008Z * [new branch] gh/alexsamardzic/12/orig -> origin/gh/alexsamardzic/12/orig 2025-12-04T09:17:10.9576561Z * [new branch] gh/alexsamardzic/14/base -> origin/gh/alexsamardzic/14/base 2025-12-04T09:17:10.9578435Z * [new branch] gh/alexsamardzic/14/head -> origin/gh/alexsamardzic/14/head 2025-12-04T09:17:10.9580188Z * [new branch] gh/alexsamardzic/14/orig -> origin/gh/alexsamardzic/14/orig 2025-12-04T09:17:10.9582719Z * [new branch] gh/alexsamardzic/15/base -> origin/gh/alexsamardzic/15/base 2025-12-04T09:17:10.9584595Z * [new branch] gh/alexsamardzic/15/head -> origin/gh/alexsamardzic/15/head 2025-12-04T09:17:10.9586567Z * [new branch] gh/alexsamardzic/15/orig -> origin/gh/alexsamardzic/15/orig 2025-12-04T09:17:10.9589569Z * [new branch] gh/amjames/18/base -> origin/gh/amjames/18/base 2025-12-04T09:17:10.9591675Z * [new branch] gh/amjames/18/head -> origin/gh/amjames/18/head 2025-12-04T09:17:10.9593551Z * [new branch] gh/amjames/18/orig -> origin/gh/amjames/18/orig 2025-12-04T09:17:10.9596626Z * [new branch] gh/andrewor14/35/base -> origin/gh/andrewor14/35/base 2025-12-04T09:17:10.9598704Z * [new branch] gh/andrewor14/35/head -> origin/gh/andrewor14/35/head 2025-12-04T09:17:10.9600655Z * [new branch] gh/andrewor14/35/orig -> origin/gh/andrewor14/35/orig 2025-12-04T09:17:10.9605273Z * [new branch] gh/andrewor14/50/base -> origin/gh/andrewor14/50/base 2025-12-04T09:17:10.9607171Z * [new branch] gh/andrewor14/50/head -> origin/gh/andrewor14/50/head 2025-12-04T09:17:10.9609061Z * [new branch] gh/andrewor14/50/orig -> origin/gh/andrewor14/50/orig 2025-12-04T09:17:10.9612266Z * [new branch] gh/andyanwang/30/base -> origin/gh/andyanwang/30/base 2025-12-04T09:17:10.9614268Z * [new branch] gh/andyanwang/30/orig -> origin/gh/andyanwang/30/orig 2025-12-04T09:17:10.9616874Z * [new branch] gh/andyanwang/31/base -> origin/gh/andyanwang/31/base 2025-12-04T09:17:10.9618979Z * [new branch] gh/andyanwang/31/orig -> origin/gh/andyanwang/31/orig 2025-12-04T09:17:10.9621709Z * [new branch] gh/andyanwang/39/base -> origin/gh/andyanwang/39/base 2025-12-04T09:17:10.9623414Z * [new branch] gh/andyanwang/39/head -> origin/gh/andyanwang/39/head 2025-12-04T09:17:10.9625250Z * [new branch] gh/andyanwang/39/orig -> origin/gh/andyanwang/39/orig 2025-12-04T09:17:10.9628013Z * [new branch] gh/andyanwang/42/base -> origin/gh/andyanwang/42/base 2025-12-04T09:17:10.9629768Z * [new branch] gh/andyanwang/42/head -> origin/gh/andyanwang/42/head 2025-12-04T09:17:10.9631690Z * [new branch] gh/andyanwang/42/orig -> origin/gh/andyanwang/42/orig 2025-12-04T09:17:10.9634350Z * [new branch] gh/andyanwang/45/base -> origin/gh/andyanwang/45/base 2025-12-04T09:17:10.9636263Z * [new branch] gh/andyanwang/45/head -> origin/gh/andyanwang/45/head 2025-12-04T09:17:10.9638098Z * [new branch] gh/andyanwang/45/orig -> origin/gh/andyanwang/45/orig 2025-12-04T09:17:10.9641461Z * [new branch] gh/angelayi/107/base -> origin/gh/angelayi/107/base 2025-12-04T09:17:10.9643254Z * [new branch] gh/angelayi/107/head -> origin/gh/angelayi/107/head 2025-12-04T09:17:10.9645920Z * [new branch] gh/angelayi/114/base -> origin/gh/angelayi/114/base 2025-12-04T09:17:10.9648074Z * [new branch] gh/angelayi/114/head -> origin/gh/angelayi/114/head 2025-12-04T09:17:10.9649733Z * [new branch] gh/angelayi/114/orig -> origin/gh/angelayi/114/orig 2025-12-04T09:17:10.9652234Z * [new branch] gh/angelayi/116/base -> origin/gh/angelayi/116/base 2025-12-04T09:17:10.9654088Z * [new branch] gh/angelayi/116/head -> origin/gh/angelayi/116/head 2025-12-04T09:17:10.9655905Z * [new branch] gh/angelayi/116/orig -> origin/gh/angelayi/116/orig 2025-12-04T09:17:10.9658607Z * [new branch] gh/angelayi/122/base -> origin/gh/angelayi/122/base 2025-12-04T09:17:10.9660432Z * [new branch] gh/angelayi/122/head -> origin/gh/angelayi/122/head 2025-12-04T09:17:10.9662307Z * [new branch] gh/angelayi/122/orig -> origin/gh/angelayi/122/orig 2025-12-04T09:17:10.9664939Z * [new branch] gh/angelayi/124/base -> origin/gh/angelayi/124/base 2025-12-04T09:17:10.9666826Z * [new branch] gh/angelayi/124/head -> origin/gh/angelayi/124/head 2025-12-04T09:17:10.9668484Z * [new branch] gh/angelayi/124/orig -> origin/gh/angelayi/124/orig 2025-12-04T09:17:10.9671594Z * [new branch] gh/angelayi/128/base -> origin/gh/angelayi/128/base 2025-12-04T09:17:10.9673271Z * [new branch] gh/angelayi/128/head -> origin/gh/angelayi/128/head 2025-12-04T09:17:10.9675195Z * [new branch] gh/angelayi/128/orig -> origin/gh/angelayi/128/orig 2025-12-04T09:17:10.9677779Z * [new branch] gh/angelayi/131/base -> origin/gh/angelayi/131/base 2025-12-04T09:17:10.9679798Z * [new branch] gh/angelayi/131/head -> origin/gh/angelayi/131/head 2025-12-04T09:17:10.9682025Z * [new branch] gh/angelayi/131/orig -> origin/gh/angelayi/131/orig 2025-12-04T09:17:10.9684910Z * [new branch] gh/angelayi/132/base -> origin/gh/angelayi/132/base 2025-12-04T09:17:10.9687006Z * [new branch] gh/angelayi/132/head -> origin/gh/angelayi/132/head 2025-12-04T09:17:10.9688983Z * [new branch] gh/angelayi/132/orig -> origin/gh/angelayi/132/orig 2025-12-04T09:17:10.9691500Z * [new branch] gh/angelayi/133/base -> origin/gh/angelayi/133/base 2025-12-04T09:17:10.9693293Z * [new branch] gh/angelayi/133/head -> origin/gh/angelayi/133/head 2025-12-04T09:17:10.9695191Z * [new branch] gh/angelayi/133/orig -> origin/gh/angelayi/133/orig 2025-12-04T09:17:10.9697877Z * [new branch] gh/angelayi/134/base -> origin/gh/angelayi/134/base 2025-12-04T09:17:10.9699837Z * [new branch] gh/angelayi/134/head -> origin/gh/angelayi/134/head 2025-12-04T09:17:10.9701971Z * [new branch] gh/angelayi/134/orig -> origin/gh/angelayi/134/orig 2025-12-04T09:17:10.9704661Z * [new branch] gh/angelayi/135/base -> origin/gh/angelayi/135/base 2025-12-04T09:17:10.9706588Z * [new branch] gh/angelayi/135/head -> origin/gh/angelayi/135/head 2025-12-04T09:17:10.9708468Z * [new branch] gh/angelayi/135/orig -> origin/gh/angelayi/135/orig 2025-12-04T09:17:10.9711026Z * [new branch] gh/angelayi/136/base -> origin/gh/angelayi/136/base 2025-12-04T09:17:10.9712802Z * [new branch] gh/angelayi/136/head -> origin/gh/angelayi/136/head 2025-12-04T09:17:10.9714607Z * [new branch] gh/angelayi/136/orig -> origin/gh/angelayi/136/orig 2025-12-04T09:17:10.9717338Z * [new branch] gh/angelayi/137/base -> origin/gh/angelayi/137/base 2025-12-04T09:17:10.9719144Z * [new branch] gh/angelayi/137/head -> origin/gh/angelayi/137/head 2025-12-04T09:17:10.9721177Z * [new branch] gh/angelayi/137/orig -> origin/gh/angelayi/137/orig 2025-12-04T09:17:10.9723601Z * [new branch] gh/angelayi/138/base -> origin/gh/angelayi/138/base 2025-12-04T09:17:10.9725388Z * [new branch] gh/angelayi/138/head -> origin/gh/angelayi/138/head 2025-12-04T09:17:10.9727093Z * [new branch] gh/angelayi/138/orig -> origin/gh/angelayi/138/orig 2025-12-04T09:17:10.9729673Z * [new branch] gh/angelayi/139/base -> origin/gh/angelayi/139/base 2025-12-04T09:17:10.9731550Z * [new branch] gh/angelayi/139/head -> origin/gh/angelayi/139/head 2025-12-04T09:17:10.9733397Z * [new branch] gh/angelayi/139/orig -> origin/gh/angelayi/139/orig 2025-12-04T09:17:10.9736025Z * [new branch] gh/angelayi/140/base -> origin/gh/angelayi/140/base 2025-12-04T09:17:10.9737637Z * [new branch] gh/angelayi/140/head -> origin/gh/angelayi/140/head 2025-12-04T09:17:10.9739777Z * [new branch] gh/angelayi/140/orig -> origin/gh/angelayi/140/orig 2025-12-04T09:17:10.9743056Z * [new branch] gh/angelayi/141/base -> origin/gh/angelayi/141/base 2025-12-04T09:17:10.9744728Z * [new branch] gh/angelayi/141/head -> origin/gh/angelayi/141/head 2025-12-04T09:17:10.9746642Z * [new branch] gh/angelayi/141/orig -> origin/gh/angelayi/141/orig 2025-12-04T09:17:10.9749215Z * [new branch] gh/angelayi/142/base -> origin/gh/angelayi/142/base 2025-12-04T09:17:10.9751075Z * [new branch] gh/angelayi/142/head -> origin/gh/angelayi/142/head 2025-12-04T09:17:10.9752984Z * [new branch] gh/angelayi/142/orig -> origin/gh/angelayi/142/orig 2025-12-04T09:17:10.9755585Z * [new branch] gh/angelayi/143/base -> origin/gh/angelayi/143/base 2025-12-04T09:17:10.9757387Z * [new branch] gh/angelayi/143/head -> origin/gh/angelayi/143/head 2025-12-04T09:17:10.9759159Z * [new branch] gh/angelayi/143/orig -> origin/gh/angelayi/143/orig 2025-12-04T09:17:10.9761992Z * [new branch] gh/angelayi/144/base -> origin/gh/angelayi/144/base 2025-12-04T09:17:10.9763906Z * [new branch] gh/angelayi/144/head -> origin/gh/angelayi/144/head 2025-12-04T09:17:10.9765761Z * [new branch] gh/angelayi/144/orig -> origin/gh/angelayi/144/orig 2025-12-04T09:17:10.9769059Z * [new branch] gh/anijain2305/753/base -> origin/gh/anijain2305/753/base 2025-12-04T09:17:10.9770939Z * [new branch] gh/anijain2305/753/head -> origin/gh/anijain2305/753/head 2025-12-04T09:17:10.9772774Z * [new branch] gh/anijain2305/753/orig -> origin/gh/anijain2305/753/orig 2025-12-04T09:17:10.9775423Z * [new branch] gh/anijain2305/810/base -> origin/gh/anijain2305/810/base 2025-12-04T09:17:10.9777282Z * [new branch] gh/anijain2305/810/head -> origin/gh/anijain2305/810/head 2025-12-04T09:17:10.9779163Z * [new branch] gh/anijain2305/810/orig -> origin/gh/anijain2305/810/orig 2025-12-04T09:17:10.9781942Z * [new branch] gh/anijain2305/854/base -> origin/gh/anijain2305/854/base 2025-12-04T09:17:10.9783915Z * [new branch] gh/anijain2305/854/head -> origin/gh/anijain2305/854/head 2025-12-04T09:17:10.9794044Z * [new branch] gh/anijain2305/854/orig -> origin/gh/anijain2305/854/orig 2025-12-04T09:17:10.9794617Z * [new branch] gh/anijain2305/864/base -> origin/gh/anijain2305/864/base 2025-12-04T09:17:10.9795148Z * [new branch] gh/anijain2305/864/head -> origin/gh/anijain2305/864/head 2025-12-04T09:17:10.9795676Z * [new branch] gh/anijain2305/864/orig -> origin/gh/anijain2305/864/orig 2025-12-04T09:17:10.9796209Z * [new branch] gh/anijain2305/870/base -> origin/gh/anijain2305/870/base 2025-12-04T09:17:10.9796758Z * [new branch] gh/anijain2305/870/head -> origin/gh/anijain2305/870/head 2025-12-04T09:17:10.9798977Z * [new branch] gh/anijain2305/870/orig -> origin/gh/anijain2305/870/orig 2025-12-04T09:17:10.9802086Z * [new branch] gh/anijain2305/873/base -> origin/gh/anijain2305/873/base 2025-12-04T09:17:10.9803910Z * [new branch] gh/anijain2305/873/head -> origin/gh/anijain2305/873/head 2025-12-04T09:17:10.9805711Z * [new branch] gh/anijain2305/873/orig -> origin/gh/anijain2305/873/orig 2025-12-04T09:17:10.9808212Z * [new branch] gh/anijain2305/894/base -> origin/gh/anijain2305/894/base 2025-12-04T09:17:10.9810110Z * [new branch] gh/anijain2305/894/head -> origin/gh/anijain2305/894/head 2025-12-04T09:17:10.9811917Z * [new branch] gh/anijain2305/894/orig -> origin/gh/anijain2305/894/orig 2025-12-04T09:17:10.9814506Z * [new branch] gh/anijain2305/895/base -> origin/gh/anijain2305/895/base 2025-12-04T09:17:10.9816388Z * [new branch] gh/anijain2305/895/head -> origin/gh/anijain2305/895/head 2025-12-04T09:17:10.9818405Z * [new branch] gh/anijain2305/895/orig -> origin/gh/anijain2305/895/orig 2025-12-04T09:17:10.9820958Z * [new branch] gh/anijain2305/910/base -> origin/gh/anijain2305/910/base 2025-12-04T09:17:10.9822882Z * [new branch] gh/anijain2305/910/head -> origin/gh/anijain2305/910/head 2025-12-04T09:17:10.9824773Z * [new branch] gh/anijain2305/910/orig -> origin/gh/anijain2305/910/orig 2025-12-04T09:17:10.9827478Z * [new branch] gh/anijain2305/919/base -> origin/gh/anijain2305/919/base 2025-12-04T09:17:10.9829478Z * [new branch] gh/anijain2305/919/head -> origin/gh/anijain2305/919/head 2025-12-04T09:17:10.9831310Z * [new branch] gh/anijain2305/919/orig -> origin/gh/anijain2305/919/orig 2025-12-04T09:17:10.9833811Z * [new branch] gh/anijain2305/922/base -> origin/gh/anijain2305/922/base 2025-12-04T09:17:10.9835755Z * [new branch] gh/anijain2305/922/head -> origin/gh/anijain2305/922/head 2025-12-04T09:17:10.9837720Z * [new branch] gh/anijain2305/922/orig -> origin/gh/anijain2305/922/orig 2025-12-04T09:17:10.9840396Z * [new branch] gh/anijain2305/932/base -> origin/gh/anijain2305/932/base 2025-12-04T09:17:10.9842303Z * [new branch] gh/anijain2305/932/head -> origin/gh/anijain2305/932/head 2025-12-04T09:17:10.9844197Z * [new branch] gh/anijain2305/932/orig -> origin/gh/anijain2305/932/orig 2025-12-04T09:17:10.9846739Z * [new branch] gh/anijain2305/940/base -> origin/gh/anijain2305/940/base 2025-12-04T09:17:10.9848603Z * [new branch] gh/anijain2305/940/head -> origin/gh/anijain2305/940/head 2025-12-04T09:17:10.9850471Z * [new branch] gh/anijain2305/940/orig -> origin/gh/anijain2305/940/orig 2025-12-04T09:17:10.9853821Z * [new branch] gh/anijain2305/941/base -> origin/gh/anijain2305/941/base 2025-12-04T09:17:10.9855599Z * [new branch] gh/anijain2305/941/head -> origin/gh/anijain2305/941/head 2025-12-04T09:17:10.9857449Z * [new branch] gh/anijain2305/941/orig -> origin/gh/anijain2305/941/orig 2025-12-04T09:17:10.9860061Z * [new branch] gh/anijain2305/942/base -> origin/gh/anijain2305/942/base 2025-12-04T09:17:10.9861877Z * [new branch] gh/anijain2305/942/head -> origin/gh/anijain2305/942/head 2025-12-04T09:17:10.9863850Z * [new branch] gh/anijain2305/942/orig -> origin/gh/anijain2305/942/orig 2025-12-04T09:17:10.9866365Z * [new branch] gh/anijain2305/943/base -> origin/gh/anijain2305/943/base 2025-12-04T09:17:10.9868442Z * [new branch] gh/anijain2305/943/head -> origin/gh/anijain2305/943/head 2025-12-04T09:17:10.9870123Z * [new branch] gh/anijain2305/943/orig -> origin/gh/anijain2305/943/orig 2025-12-04T09:17:10.9873322Z * [new branch] gh/anijain2305/944/base -> origin/gh/anijain2305/944/base 2025-12-04T09:17:10.9875170Z * [new branch] gh/anijain2305/944/head -> origin/gh/anijain2305/944/head 2025-12-04T09:17:10.9877472Z * [new branch] gh/anijain2305/944/orig -> origin/gh/anijain2305/944/orig 2025-12-04T09:17:10.9880265Z * [new branch] gh/anijain2305/945/base -> origin/gh/anijain2305/945/base 2025-12-04T09:17:10.9882158Z * [new branch] gh/anijain2305/945/head -> origin/gh/anijain2305/945/head 2025-12-04T09:17:10.9884007Z * [new branch] gh/anijain2305/945/orig -> origin/gh/anijain2305/945/orig 2025-12-04T09:17:10.9886674Z * [new branch] gh/anijain2305/946/base -> origin/gh/anijain2305/946/base 2025-12-04T09:17:10.9888636Z * [new branch] gh/anijain2305/946/head -> origin/gh/anijain2305/946/head 2025-12-04T09:17:10.9890434Z * [new branch] gh/anijain2305/946/orig -> origin/gh/anijain2305/946/orig 2025-12-04T09:17:10.9893326Z * [new branch] gh/anijain2305/947/base -> origin/gh/anijain2305/947/base 2025-12-04T09:17:10.9894598Z * [new branch] gh/anijain2305/947/head -> origin/gh/anijain2305/947/head 2025-12-04T09:17:10.9896691Z * [new branch] gh/anijain2305/947/orig -> origin/gh/anijain2305/947/orig 2025-12-04T09:17:10.9899482Z * [new branch] gh/anijain2305/948/base -> origin/gh/anijain2305/948/base 2025-12-04T09:17:10.9901723Z * [new branch] gh/anijain2305/948/head -> origin/gh/anijain2305/948/head 2025-12-04T09:17:10.9903065Z * [new branch] gh/anijain2305/948/orig -> origin/gh/anijain2305/948/orig 2025-12-04T09:17:10.9906060Z * [new branch] gh/anijain2305/949/base -> origin/gh/anijain2305/949/base 2025-12-04T09:17:10.9907912Z * [new branch] gh/anijain2305/949/head -> origin/gh/anijain2305/949/head 2025-12-04T09:17:10.9909762Z * [new branch] gh/anijain2305/949/orig -> origin/gh/anijain2305/949/orig 2025-12-04T09:17:10.9912376Z * [new branch] gh/anijain2305/950/base -> origin/gh/anijain2305/950/base 2025-12-04T09:17:10.9914252Z * [new branch] gh/anijain2305/950/head -> origin/gh/anijain2305/950/head 2025-12-04T09:17:10.9916061Z * [new branch] gh/anijain2305/950/orig -> origin/gh/anijain2305/950/orig 2025-12-04T09:17:10.9918749Z * [new branch] gh/anijain2305/951/base -> origin/gh/anijain2305/951/base 2025-12-04T09:17:10.9920735Z * [new branch] gh/anijain2305/951/head -> origin/gh/anijain2305/951/head 2025-12-04T09:17:10.9922612Z * [new branch] gh/anijain2305/951/orig -> origin/gh/anijain2305/951/orig 2025-12-04T09:17:10.9925339Z * [new branch] gh/anijain2305/952/base -> origin/gh/anijain2305/952/base 2025-12-04T09:17:10.9926869Z * [new branch] gh/anijain2305/952/head -> origin/gh/anijain2305/952/head 2025-12-04T09:17:10.9928994Z * [new branch] gh/anijain2305/952/orig -> origin/gh/anijain2305/952/orig 2025-12-04T09:17:10.9931642Z * [new branch] gh/anijain2305/953/base -> origin/gh/anijain2305/953/base 2025-12-04T09:17:10.9933484Z * [new branch] gh/anijain2305/953/head -> origin/gh/anijain2305/953/head 2025-12-04T09:17:10.9936241Z * [new branch] gh/anijain2305/953/orig -> origin/gh/anijain2305/953/orig 2025-12-04T09:17:10.9938632Z * [new branch] gh/anijain2305/954/base -> origin/gh/anijain2305/954/base 2025-12-04T09:17:10.9940200Z * [new branch] gh/anijain2305/954/head -> origin/gh/anijain2305/954/head 2025-12-04T09:17:10.9941955Z * [new branch] gh/anijain2305/954/orig -> origin/gh/anijain2305/954/orig 2025-12-04T09:17:10.9945085Z * [new branch] gh/anijain2305/955/base -> origin/gh/anijain2305/955/base 2025-12-04T09:17:10.9946370Z * [new branch] gh/anijain2305/955/head -> origin/gh/anijain2305/955/head 2025-12-04T09:17:10.9948556Z * [new branch] gh/anijain2305/955/orig -> origin/gh/anijain2305/955/orig 2025-12-04T09:17:10.9951187Z * [new branch] gh/anijain2305/956/base -> origin/gh/anijain2305/956/base 2025-12-04T09:17:10.9953030Z * [new branch] gh/anijain2305/956/head -> origin/gh/anijain2305/956/head 2025-12-04T09:17:10.9955073Z * [new branch] gh/anijain2305/956/orig -> origin/gh/anijain2305/956/orig 2025-12-04T09:17:10.9957615Z * [new branch] gh/anijain2305/957/base -> origin/gh/anijain2305/957/base 2025-12-04T09:17:10.9959641Z * [new branch] gh/anijain2305/957/head -> origin/gh/anijain2305/957/head 2025-12-04T09:17:10.9961592Z * [new branch] gh/anijain2305/957/orig -> origin/gh/anijain2305/957/orig 2025-12-04T09:17:10.9964245Z * [new branch] gh/anijain2305/958/base -> origin/gh/anijain2305/958/base 2025-12-04T09:17:10.9966457Z * [new branch] gh/anijain2305/958/head -> origin/gh/anijain2305/958/head 2025-12-04T09:17:10.9968126Z * [new branch] gh/anijain2305/958/orig -> origin/gh/anijain2305/958/orig 2025-12-04T09:17:10.9970711Z * [new branch] gh/anijain2305/959/base -> origin/gh/anijain2305/959/base 2025-12-04T09:17:10.9972922Z * [new branch] gh/anijain2305/959/head -> origin/gh/anijain2305/959/head 2025-12-04T09:17:10.9974754Z * [new branch] gh/anijain2305/959/orig -> origin/gh/anijain2305/959/orig 2025-12-04T09:17:10.9977403Z * [new branch] gh/anijain2305/960/base -> origin/gh/anijain2305/960/base 2025-12-04T09:17:10.9979340Z * [new branch] gh/anijain2305/960/head -> origin/gh/anijain2305/960/head 2025-12-04T09:17:10.9981219Z * [new branch] gh/anijain2305/960/orig -> origin/gh/anijain2305/960/orig 2025-12-04T09:17:10.9984019Z * [new branch] gh/anijain2305/961/base -> origin/gh/anijain2305/961/base 2025-12-04T09:17:10.9985897Z * [new branch] gh/anijain2305/961/head -> origin/gh/anijain2305/961/head 2025-12-04T09:17:10.9987694Z * [new branch] gh/anijain2305/961/orig -> origin/gh/anijain2305/961/orig 2025-12-04T09:17:10.9990366Z * [new branch] gh/anijain2305/962/base -> origin/gh/anijain2305/962/base 2025-12-04T09:17:10.9992177Z * [new branch] gh/anijain2305/962/head -> origin/gh/anijain2305/962/head 2025-12-04T09:17:10.9994012Z * [new branch] gh/anijain2305/962/orig -> origin/gh/anijain2305/962/orig 2025-12-04T09:17:10.9997028Z * [new branch] gh/anijain2305/963/base -> origin/gh/anijain2305/963/base 2025-12-04T09:17:10.9998952Z * [new branch] gh/anijain2305/963/head -> origin/gh/anijain2305/963/head 2025-12-04T09:17:11.0001333Z * [new branch] gh/anijain2305/963/orig -> origin/gh/anijain2305/963/orig 2025-12-04T09:17:11.0005192Z * [new branch] gh/anijain2305/964/base -> origin/gh/anijain2305/964/base 2025-12-04T09:17:11.0007671Z * [new branch] gh/anijain2305/964/head -> origin/gh/anijain2305/964/head 2025-12-04T09:17:11.0009508Z * [new branch] gh/anijain2305/964/orig -> origin/gh/anijain2305/964/orig 2025-12-04T09:17:11.0012242Z * [new branch] gh/anijain2305/965/base -> origin/gh/anijain2305/965/base 2025-12-04T09:17:11.0014186Z * [new branch] gh/anijain2305/965/head -> origin/gh/anijain2305/965/head 2025-12-04T09:17:11.0016109Z * [new branch] gh/anijain2305/965/orig -> origin/gh/anijain2305/965/orig 2025-12-04T09:17:11.0018641Z * [new branch] gh/anijain2305/966/base -> origin/gh/anijain2305/966/base 2025-12-04T09:17:11.0020500Z * [new branch] gh/anijain2305/966/head -> origin/gh/anijain2305/966/head 2025-12-04T09:17:11.0022249Z * [new branch] gh/anijain2305/966/orig -> origin/gh/anijain2305/966/orig 2025-12-04T09:17:11.0024992Z * [new branch] gh/anijain2305/967/base -> origin/gh/anijain2305/967/base 2025-12-04T09:17:11.0026880Z * [new branch] gh/anijain2305/967/head -> origin/gh/anijain2305/967/head 2025-12-04T09:17:11.0028864Z * [new branch] gh/anijain2305/967/orig -> origin/gh/anijain2305/967/orig 2025-12-04T09:17:11.0031422Z * [new branch] gh/anijain2305/968/base -> origin/gh/anijain2305/968/base 2025-12-04T09:17:11.0033275Z * [new branch] gh/anijain2305/968/head -> origin/gh/anijain2305/968/head 2025-12-04T09:17:11.0035119Z * [new branch] gh/anijain2305/968/orig -> origin/gh/anijain2305/968/orig 2025-12-04T09:17:11.0037757Z * [new branch] gh/anijain2305/969/base -> origin/gh/anijain2305/969/base 2025-12-04T09:17:11.0039743Z * [new branch] gh/anijain2305/969/head -> origin/gh/anijain2305/969/head 2025-12-04T09:17:11.0041878Z * [new branch] gh/anijain2305/969/orig -> origin/gh/anijain2305/969/orig 2025-12-04T09:17:11.0044423Z * [new branch] gh/anijain2305/970/base -> origin/gh/anijain2305/970/base 2025-12-04T09:17:11.0046321Z * [new branch] gh/anijain2305/970/head -> origin/gh/anijain2305/970/head 2025-12-04T09:17:11.0048053Z * [new branch] gh/anijain2305/970/orig -> origin/gh/anijain2305/970/orig 2025-12-04T09:17:11.0051401Z * [new branch] gh/anjali411/216/base -> origin/gh/anjali411/216/base 2025-12-04T09:17:11.0053284Z * [new branch] gh/anjali411/216/head -> origin/gh/anjali411/216/head 2025-12-04T09:17:11.0055129Z * [new branch] gh/anjali411/216/orig -> origin/gh/anjali411/216/orig 2025-12-04T09:17:11.0058327Z * [new branch] gh/anshul-si/1/base -> origin/gh/anshul-si/1/base 2025-12-04T09:17:11.0060208Z * [new branch] gh/anshul-si/1/head -> origin/gh/anshul-si/1/head 2025-12-04T09:17:11.0062744Z * [new branch] gh/anshul-si/2/base -> origin/gh/anshul-si/2/base 2025-12-04T09:17:11.0064556Z * [new branch] gh/anshul-si/2/head -> origin/gh/anshul-si/2/head 2025-12-04T09:17:11.0067000Z * [new branch] gh/anshul-si/3/base -> origin/gh/anshul-si/3/base 2025-12-04T09:17:11.0068823Z * [new branch] gh/anshul-si/3/head -> origin/gh/anshul-si/3/head 2025-12-04T09:17:11.0071166Z * [new branch] gh/anshul-si/4/base -> origin/gh/anshul-si/4/base 2025-12-04T09:17:11.0072966Z * [new branch] gh/anshul-si/4/head -> origin/gh/anshul-si/4/head 2025-12-04T09:17:11.0075300Z * [new branch] gh/anshul-si/5/base -> origin/gh/anshul-si/5/base 2025-12-04T09:17:11.0077145Z * [new branch] gh/anshul-si/5/head -> origin/gh/anshul-si/5/head 2025-12-04T09:17:11.0079955Z * [new branch] gh/anshul-si/53/base -> origin/gh/anshul-si/53/base 2025-12-04T09:17:11.0081867Z * [new branch] gh/anshul-si/53/head -> origin/gh/anshul-si/53/head 2025-12-04T09:17:11.0084519Z * [new branch] gh/anshul-si/58/base -> origin/gh/anshul-si/58/base 2025-12-04T09:17:11.0086373Z * [new branch] gh/anshul-si/58/head -> origin/gh/anshul-si/58/head 2025-12-04T09:17:11.0088720Z * [new branch] gh/anshul-si/66/base -> origin/gh/anshul-si/66/base 2025-12-04T09:17:11.0090664Z * [new branch] gh/anshul-si/66/head -> origin/gh/anshul-si/66/head 2025-12-04T09:17:11.0092576Z * [new branch] gh/anshul-si/66/orig -> origin/gh/anshul-si/66/orig 2025-12-04T09:17:11.0094952Z * [new branch] gh/anshul-si/67/base -> origin/gh/anshul-si/67/base 2025-12-04T09:17:11.0096809Z * [new branch] gh/anshul-si/67/head -> origin/gh/anshul-si/67/head 2025-12-04T09:17:11.0098639Z * [new branch] gh/anshul-si/67/orig -> origin/gh/anshul-si/67/orig 2025-12-04T09:17:11.0101629Z * [new branch] gh/anshul-si/68/base -> origin/gh/anshul-si/68/base 2025-12-04T09:17:11.0103391Z * [new branch] gh/anshul-si/68/head -> origin/gh/anshul-si/68/head 2025-12-04T09:17:11.0105138Z * [new branch] gh/anshul-si/68/orig -> origin/gh/anshul-si/68/orig 2025-12-04T09:17:11.0107989Z * [new branch] gh/anshul-si/69/base -> origin/gh/anshul-si/69/base 2025-12-04T09:17:11.0109844Z * [new branch] gh/anshul-si/69/head -> origin/gh/anshul-si/69/head 2025-12-04T09:17:11.0111732Z * [new branch] gh/anshul-si/69/orig -> origin/gh/anshul-si/69/orig 2025-12-04T09:17:11.0114257Z * [new branch] gh/anshul-si/70/base -> origin/gh/anshul-si/70/base 2025-12-04T09:17:11.0116128Z * [new branch] gh/anshul-si/70/head -> origin/gh/anshul-si/70/head 2025-12-04T09:17:11.0118125Z * [new branch] gh/anshul-si/70/orig -> origin/gh/anshul-si/70/orig 2025-12-04T09:17:11.0120786Z * [new branch] gh/anshul-si/71/base -> origin/gh/anshul-si/71/base 2025-12-04T09:17:11.0122656Z * [new branch] gh/anshul-si/71/head -> origin/gh/anshul-si/71/head 2025-12-04T09:17:11.0124493Z * [new branch] gh/anshul-si/71/orig -> origin/gh/anshul-si/71/orig 2025-12-04T09:17:11.0127088Z * [new branch] gh/anshul-si/72/base -> origin/gh/anshul-si/72/base 2025-12-04T09:17:11.0129057Z * [new branch] gh/anshul-si/72/head -> origin/gh/anshul-si/72/head 2025-12-04T09:17:11.0130843Z * [new branch] gh/anshul-si/72/orig -> origin/gh/anshul-si/72/orig 2025-12-04T09:17:11.0133346Z * [new branch] gh/anshul-si/73/base -> origin/gh/anshul-si/73/base 2025-12-04T09:17:11.0135366Z * [new branch] gh/anshul-si/73/head -> origin/gh/anshul-si/73/head 2025-12-04T09:17:11.0137443Z * [new branch] gh/anshul-si/73/orig -> origin/gh/anshul-si/73/orig 2025-12-04T09:17:11.0140800Z * [new branch] gh/aorenste/132/base -> origin/gh/aorenste/132/base 2025-12-04T09:17:11.0142683Z * [new branch] gh/aorenste/132/head -> origin/gh/aorenste/132/head 2025-12-04T09:17:11.0145426Z * [new branch] gh/aorenste/134/base -> origin/gh/aorenste/134/base 2025-12-04T09:17:11.0147390Z * [new branch] gh/aorenste/134/head -> origin/gh/aorenste/134/head 2025-12-04T09:17:11.0149273Z * [new branch] gh/aorenste/134/orig -> origin/gh/aorenste/134/orig 2025-12-04T09:17:11.0151842Z * [new branch] gh/aorenste/139/base -> origin/gh/aorenste/139/base 2025-12-04T09:17:11.0153677Z * [new branch] gh/aorenste/139/head -> origin/gh/aorenste/139/head 2025-12-04T09:17:11.0155559Z * [new branch] gh/aorenste/139/orig -> origin/gh/aorenste/139/orig 2025-12-04T09:17:11.0158135Z * [new branch] gh/aorenste/141/base -> origin/gh/aorenste/141/base 2025-12-04T09:17:11.0160151Z * [new branch] gh/aorenste/141/head -> origin/gh/aorenste/141/head 2025-12-04T09:17:11.0163045Z * [new branch] gh/aorenste/145/base -> origin/gh/aorenste/145/base 2025-12-04T09:17:11.0164912Z * [new branch] gh/aorenste/145/head -> origin/gh/aorenste/145/head 2025-12-04T09:17:11.0166900Z * [new branch] gh/aorenste/145/orig -> origin/gh/aorenste/145/orig 2025-12-04T09:17:11.0169593Z * [new branch] gh/aorenste/146/base -> origin/gh/aorenste/146/base 2025-12-04T09:17:11.0171555Z * [new branch] gh/aorenste/146/head -> origin/gh/aorenste/146/head 2025-12-04T09:17:11.0173553Z * [new branch] gh/aorenste/146/orig -> origin/gh/aorenste/146/orig 2025-12-04T09:17:11.0176103Z * [new branch] gh/aorenste/147/base -> origin/gh/aorenste/147/base 2025-12-04T09:17:11.0178030Z * [new branch] gh/aorenste/147/head -> origin/gh/aorenste/147/head 2025-12-04T09:17:11.0179888Z * [new branch] gh/aorenste/147/orig -> origin/gh/aorenste/147/orig 2025-12-04T09:17:11.0182431Z * [new branch] gh/aorenste/148/base -> origin/gh/aorenste/148/base 2025-12-04T09:17:11.0184300Z * [new branch] gh/aorenste/148/head -> origin/gh/aorenste/148/head 2025-12-04T09:17:11.0186310Z * [new branch] gh/aorenste/148/orig -> origin/gh/aorenste/148/orig 2025-12-04T09:17:11.0189103Z * [new branch] gh/aorenste/149/base -> origin/gh/aorenste/149/base 2025-12-04T09:17:11.0190960Z * [new branch] gh/aorenste/149/head -> origin/gh/aorenste/149/head 2025-12-04T09:17:11.0192839Z * [new branch] gh/aorenste/149/orig -> origin/gh/aorenste/149/orig 2025-12-04T09:17:11.0195471Z * [new branch] gh/aorenste/150/base -> origin/gh/aorenste/150/base 2025-12-04T09:17:11.0197256Z * [new branch] gh/aorenste/150/head -> origin/gh/aorenste/150/head 2025-12-04T09:17:11.0199155Z * [new branch] gh/aorenste/150/orig -> origin/gh/aorenste/150/orig 2025-12-04T09:17:11.0202925Z * [new branch] gh/aorenste/151/base -> origin/gh/aorenste/151/base 2025-12-04T09:17:11.0205181Z * [new branch] gh/aorenste/151/head -> origin/gh/aorenste/151/head 2025-12-04T09:17:11.0207201Z * [new branch] gh/aorenste/151/orig -> origin/gh/aorenste/151/orig 2025-12-04T09:17:11.0209803Z * [new branch] gh/aorenste/152/base -> origin/gh/aorenste/152/base 2025-12-04T09:17:11.0211618Z * [new branch] gh/aorenste/152/head -> origin/gh/aorenste/152/head 2025-12-04T09:17:11.0213441Z * [new branch] gh/aorenste/152/orig -> origin/gh/aorenste/152/orig 2025-12-04T09:17:11.0215822Z * [new branch] gh/aorenste/153/base -> origin/gh/aorenste/153/base 2025-12-04T09:17:11.0217615Z * [new branch] gh/aorenste/153/head -> origin/gh/aorenste/153/head 2025-12-04T09:17:11.0219505Z * [new branch] gh/aorenste/153/orig -> origin/gh/aorenste/153/orig 2025-12-04T09:17:11.0222447Z * [new branch] gh/aorenste/154/base -> origin/gh/aorenste/154/base 2025-12-04T09:17:11.0223672Z * [new branch] gh/aorenste/154/head -> origin/gh/aorenste/154/head 2025-12-04T09:17:11.0225782Z * [new branch] gh/aorenste/154/orig -> origin/gh/aorenste/154/orig 2025-12-04T09:17:11.0228116Z * [new branch] gh/aorenste/155/base -> origin/gh/aorenste/155/base 2025-12-04T09:17:11.0230187Z * [new branch] gh/aorenste/155/head -> origin/gh/aorenste/155/head 2025-12-04T09:17:11.0231559Z * [new branch] gh/aorenste/155/orig -> origin/gh/aorenste/155/orig 2025-12-04T09:17:11.0234191Z * [new branch] gh/aorenste/156/base -> origin/gh/aorenste/156/base 2025-12-04T09:17:11.0235997Z * [new branch] gh/aorenste/156/head -> origin/gh/aorenste/156/head 2025-12-04T09:17:11.0237802Z * [new branch] gh/aorenste/156/orig -> origin/gh/aorenste/156/orig 2025-12-04T09:17:11.0241012Z * [new branch] gh/aorenste/157/base -> origin/gh/aorenste/157/base 2025-12-04T09:17:11.0242845Z * [new branch] gh/aorenste/157/head -> origin/gh/aorenste/157/head 2025-12-04T09:17:11.0244816Z * [new branch] gh/aorenste/157/orig -> origin/gh/aorenste/157/orig 2025-12-04T09:17:11.0247260Z * [new branch] gh/aorenste/158/base -> origin/gh/aorenste/158/base 2025-12-04T09:17:11.0249143Z * [new branch] gh/aorenste/158/head -> origin/gh/aorenste/158/head 2025-12-04T09:17:11.0250918Z * [new branch] gh/aorenste/158/orig -> origin/gh/aorenste/158/orig 2025-12-04T09:17:11.0253357Z * [new branch] gh/aorenste/159/base -> origin/gh/aorenste/159/base 2025-12-04T09:17:11.0255264Z * [new branch] gh/aorenste/159/head -> origin/gh/aorenste/159/head 2025-12-04T09:17:11.0256962Z * [new branch] gh/aorenste/159/orig -> origin/gh/aorenste/159/orig 2025-12-04T09:17:11.0260144Z * [new branch] gh/avikchaudhuri/1/base -> origin/gh/avikchaudhuri/1/base 2025-12-04T09:17:11.0262059Z * [new branch] gh/avikchaudhuri/1/head -> origin/gh/avikchaudhuri/1/head 2025-12-04T09:17:11.0264361Z * [new branch] gh/avikchaudhuri/2/base -> origin/gh/avikchaudhuri/2/base 2025-12-04T09:17:11.0266158Z * [new branch] gh/avikchaudhuri/2/head -> origin/gh/avikchaudhuri/2/head 2025-12-04T09:17:11.0267958Z * [new branch] gh/avikchaudhuri/2/orig -> origin/gh/avikchaudhuri/2/orig 2025-12-04T09:17:11.0271578Z * [new branch] gh/bdhirsh/666/base -> origin/gh/bdhirsh/666/base 2025-12-04T09:17:11.0273369Z * [new branch] gh/bdhirsh/666/head -> origin/gh/bdhirsh/666/head 2025-12-04T09:17:11.0275231Z * [new branch] gh/bdhirsh/666/orig -> origin/gh/bdhirsh/666/orig 2025-12-04T09:17:11.0277761Z * [new branch] gh/bdhirsh/668/base -> origin/gh/bdhirsh/668/base 2025-12-04T09:17:11.0279722Z * [new branch] gh/bdhirsh/668/head -> origin/gh/bdhirsh/668/head 2025-12-04T09:17:11.0281621Z * [new branch] gh/bdhirsh/668/orig -> origin/gh/bdhirsh/668/orig 2025-12-04T09:17:11.0284322Z * [new branch] gh/bdhirsh/669/base -> origin/gh/bdhirsh/669/base 2025-12-04T09:17:11.0286166Z * [new branch] gh/bdhirsh/669/head -> origin/gh/bdhirsh/669/head 2025-12-04T09:17:11.0287925Z * [new branch] gh/bdhirsh/669/orig -> origin/gh/bdhirsh/669/orig 2025-12-04T09:17:11.0290696Z * [new branch] gh/bdhirsh/670/base -> origin/gh/bdhirsh/670/base 2025-12-04T09:17:11.0292641Z * [new branch] gh/bdhirsh/670/head -> origin/gh/bdhirsh/670/head 2025-12-04T09:17:11.0294586Z * [new branch] gh/bdhirsh/670/orig -> origin/gh/bdhirsh/670/orig 2025-12-04T09:17:11.0297078Z * [new branch] gh/bdhirsh/672/base -> origin/gh/bdhirsh/672/base 2025-12-04T09:17:11.0298935Z * [new branch] gh/bdhirsh/672/head -> origin/gh/bdhirsh/672/head 2025-12-04T09:17:11.0300633Z * [new branch] gh/bdhirsh/672/orig -> origin/gh/bdhirsh/672/orig 2025-12-04T09:17:11.0303931Z * [new branch] gh/bdhirsh/675/base -> origin/gh/bdhirsh/675/base 2025-12-04T09:17:11.0305846Z * [new branch] gh/bdhirsh/675/head -> origin/gh/bdhirsh/675/head 2025-12-04T09:17:11.0307696Z * [new branch] gh/bdhirsh/675/orig -> origin/gh/bdhirsh/675/orig 2025-12-04T09:17:11.0310115Z * [new branch] gh/bdhirsh/676/base -> origin/gh/bdhirsh/676/base 2025-12-04T09:17:11.0312117Z * [new branch] gh/bdhirsh/676/head -> origin/gh/bdhirsh/676/head 2025-12-04T09:17:11.0313942Z * [new branch] gh/bdhirsh/676/orig -> origin/gh/bdhirsh/676/orig 2025-12-04T09:17:11.0316496Z * [new branch] gh/bdhirsh/677/base -> origin/gh/bdhirsh/677/base 2025-12-04T09:17:11.0318782Z * [new branch] gh/bdhirsh/677/head -> origin/gh/bdhirsh/677/head 2025-12-04T09:17:11.0320830Z * [new branch] gh/bdhirsh/677/orig -> origin/gh/bdhirsh/677/orig 2025-12-04T09:17:11.0323698Z * [new branch] gh/bdhirsh/678/base -> origin/gh/bdhirsh/678/base 2025-12-04T09:17:11.0325475Z * [new branch] gh/bdhirsh/678/head -> origin/gh/bdhirsh/678/head 2025-12-04T09:17:11.0327257Z * [new branch] gh/bdhirsh/678/orig -> origin/gh/bdhirsh/678/orig 2025-12-04T09:17:11.0330011Z * [new branch] gh/bdhirsh/679/base -> origin/gh/bdhirsh/679/base 2025-12-04T09:17:11.0331970Z * [new branch] gh/bdhirsh/679/head -> origin/gh/bdhirsh/679/head 2025-12-04T09:17:11.0333973Z * [new branch] gh/bdhirsh/679/orig -> origin/gh/bdhirsh/679/orig 2025-12-04T09:17:11.0336263Z * [new branch] gh/bdhirsh/680/base -> origin/gh/bdhirsh/680/base 2025-12-04T09:17:11.0338241Z * [new branch] gh/bdhirsh/680/head -> origin/gh/bdhirsh/680/head 2025-12-04T09:17:11.0340186Z * [new branch] gh/bdhirsh/680/orig -> origin/gh/bdhirsh/680/orig 2025-12-04T09:17:11.0342570Z * [new branch] gh/bdhirsh/681/base -> origin/gh/bdhirsh/681/base 2025-12-04T09:17:11.0344478Z * [new branch] gh/bdhirsh/681/head -> origin/gh/bdhirsh/681/head 2025-12-04T09:17:11.0346915Z * [new branch] gh/bdhirsh/681/orig -> origin/gh/bdhirsh/681/orig 2025-12-04T09:17:11.0349557Z * [new branch] gh/benjaminglass1/101/base -> origin/gh/benjaminglass1/101/base 2025-12-04T09:17:11.0351518Z * [new branch] gh/benjaminglass1/101/head -> origin/gh/benjaminglass1/101/head 2025-12-04T09:17:11.0353082Z * [new branch] gh/benjaminglass1/101/orig -> origin/gh/benjaminglass1/101/orig 2025-12-04T09:17:11.0355802Z * [new branch] gh/benjaminglass1/102/base -> origin/gh/benjaminglass1/102/base 2025-12-04T09:17:11.0357715Z * [new branch] gh/benjaminglass1/102/head -> origin/gh/benjaminglass1/102/head 2025-12-04T09:17:11.0359651Z * [new branch] gh/benjaminglass1/102/orig -> origin/gh/benjaminglass1/102/orig 2025-12-04T09:17:11.0362329Z * [new branch] gh/benjaminglass1/106/base -> origin/gh/benjaminglass1/106/base 2025-12-04T09:17:11.0364212Z * [new branch] gh/benjaminglass1/106/head -> origin/gh/benjaminglass1/106/head 2025-12-04T09:17:11.0366021Z * [new branch] gh/benjaminglass1/106/orig -> origin/gh/benjaminglass1/106/orig 2025-12-04T09:17:11.0368571Z * [new branch] gh/benjaminglass1/107/base -> origin/gh/benjaminglass1/107/base 2025-12-04T09:17:11.0370527Z * [new branch] gh/benjaminglass1/107/head -> origin/gh/benjaminglass1/107/head 2025-12-04T09:17:11.0372398Z * [new branch] gh/benjaminglass1/107/orig -> origin/gh/benjaminglass1/107/orig 2025-12-04T09:17:11.0374954Z * [new branch] gh/benjaminglass1/108/base -> origin/gh/benjaminglass1/108/base 2025-12-04T09:17:11.0376728Z * [new branch] gh/benjaminglass1/108/head -> origin/gh/benjaminglass1/108/head 2025-12-04T09:17:11.0378595Z * [new branch] gh/benjaminglass1/108/orig -> origin/gh/benjaminglass1/108/orig 2025-12-04T09:17:11.0381103Z * [new branch] gh/benjaminglass1/109/base -> origin/gh/benjaminglass1/109/base 2025-12-04T09:17:11.0382972Z * [new branch] gh/benjaminglass1/109/head -> origin/gh/benjaminglass1/109/head 2025-12-04T09:17:11.0384848Z * [new branch] gh/benjaminglass1/109/orig -> origin/gh/benjaminglass1/109/orig 2025-12-04T09:17:11.0387351Z * [new branch] gh/benjaminglass1/97/base -> origin/gh/benjaminglass1/97/base 2025-12-04T09:17:11.0389185Z * [new branch] gh/benjaminglass1/97/head -> origin/gh/benjaminglass1/97/head 2025-12-04T09:17:11.0391258Z * [new branch] gh/benjaminglass1/97/orig -> origin/gh/benjaminglass1/97/orig 2025-12-04T09:17:11.0394373Z * [new branch] gh/bobrenjc93/570/base -> origin/gh/bobrenjc93/570/base 2025-12-04T09:17:11.0396237Z * [new branch] gh/bobrenjc93/570/head -> origin/gh/bobrenjc93/570/head 2025-12-04T09:17:11.0398075Z * [new branch] gh/bobrenjc93/570/orig -> origin/gh/bobrenjc93/570/orig 2025-12-04T09:17:11.0400642Z * [new branch] gh/bobrenjc93/604/base -> origin/gh/bobrenjc93/604/base 2025-12-04T09:17:11.0402827Z * [new branch] gh/bobrenjc93/604/head -> origin/gh/bobrenjc93/604/head 2025-12-04T09:17:11.0404685Z * [new branch] gh/bobrenjc93/604/orig -> origin/gh/bobrenjc93/604/orig 2025-12-04T09:17:11.0407276Z * [new branch] gh/bobrenjc93/638/base -> origin/gh/bobrenjc93/638/base 2025-12-04T09:17:11.0409172Z * [new branch] gh/bobrenjc93/638/head -> origin/gh/bobrenjc93/638/head 2025-12-04T09:17:11.0411158Z * [new branch] gh/bobrenjc93/638/orig -> origin/gh/bobrenjc93/638/orig 2025-12-04T09:17:11.0413512Z * [new branch] gh/bobrenjc93/653/base -> origin/gh/bobrenjc93/653/base 2025-12-04T09:17:11.0415372Z * [new branch] gh/bobrenjc93/653/head -> origin/gh/bobrenjc93/653/head 2025-12-04T09:17:11.0417171Z * [new branch] gh/bobrenjc93/653/orig -> origin/gh/bobrenjc93/653/orig 2025-12-04T09:17:11.0420051Z * [new branch] gh/bobrenjc93/654/base -> origin/gh/bobrenjc93/654/base 2025-12-04T09:17:11.0421784Z * [new branch] gh/bobrenjc93/654/head -> origin/gh/bobrenjc93/654/head 2025-12-04T09:17:11.0423517Z * [new branch] gh/bobrenjc93/654/orig -> origin/gh/bobrenjc93/654/orig 2025-12-04T09:17:11.0425951Z * [new branch] gh/bobrenjc93/657/base -> origin/gh/bobrenjc93/657/base 2025-12-04T09:17:11.0427876Z * [new branch] gh/bobrenjc93/657/head -> origin/gh/bobrenjc93/657/head 2025-12-04T09:17:11.0429596Z * [new branch] gh/bobrenjc93/657/orig -> origin/gh/bobrenjc93/657/orig 2025-12-04T09:17:11.0432200Z * [new branch] gh/bobrenjc93/672/base -> origin/gh/bobrenjc93/672/base 2025-12-04T09:17:11.0433914Z * [new branch] gh/bobrenjc93/672/head -> origin/gh/bobrenjc93/672/head 2025-12-04T09:17:11.0435785Z * [new branch] gh/bobrenjc93/672/orig -> origin/gh/bobrenjc93/672/orig 2025-12-04T09:17:11.0438267Z * [new branch] gh/bobrenjc93/679/base -> origin/gh/bobrenjc93/679/base 2025-12-04T09:17:11.0440458Z * [new branch] gh/bobrenjc93/679/head -> origin/gh/bobrenjc93/679/head 2025-12-04T09:17:11.0442354Z * [new branch] gh/bobrenjc93/679/orig -> origin/gh/bobrenjc93/679/orig 2025-12-04T09:17:11.0445050Z * [new branch] gh/bobrenjc93/680/base -> origin/gh/bobrenjc93/680/base 2025-12-04T09:17:11.0446875Z * [new branch] gh/bobrenjc93/680/head -> origin/gh/bobrenjc93/680/head 2025-12-04T09:17:11.0448694Z * [new branch] gh/bobrenjc93/680/orig -> origin/gh/bobrenjc93/680/orig 2025-12-04T09:17:11.0451023Z * [new branch] gh/bobrenjc93/681/base -> origin/gh/bobrenjc93/681/base 2025-12-04T09:17:11.0452946Z * [new branch] gh/bobrenjc93/681/head -> origin/gh/bobrenjc93/681/head 2025-12-04T09:17:11.0454829Z * [new branch] gh/bobrenjc93/681/orig -> origin/gh/bobrenjc93/681/orig 2025-12-04T09:17:11.0457156Z * [new branch] gh/bobrenjc93/682/base -> origin/gh/bobrenjc93/682/base 2025-12-04T09:17:11.0458964Z * [new branch] gh/bobrenjc93/682/head -> origin/gh/bobrenjc93/682/head 2025-12-04T09:17:11.0460817Z * [new branch] gh/bobrenjc93/682/orig -> origin/gh/bobrenjc93/682/orig 2025-12-04T09:17:11.0463726Z * [new branch] gh/bobrenjc93/683/base -> origin/gh/bobrenjc93/683/base 2025-12-04T09:17:11.0464943Z * [new branch] gh/bobrenjc93/683/head -> origin/gh/bobrenjc93/683/head 2025-12-04T09:17:11.0466959Z * [new branch] gh/bobrenjc93/683/orig -> origin/gh/bobrenjc93/683/orig 2025-12-04T09:17:11.0469669Z * [new branch] gh/bobrenjc93/684/base -> origin/gh/bobrenjc93/684/base 2025-12-04T09:17:11.0471640Z * [new branch] gh/bobrenjc93/684/head -> origin/gh/bobrenjc93/684/head 2025-12-04T09:17:11.0473696Z * [new branch] gh/bobrenjc93/684/orig -> origin/gh/bobrenjc93/684/orig 2025-12-04T09:17:11.0476183Z * [new branch] gh/bobrenjc93/685/base -> origin/gh/bobrenjc93/685/base 2025-12-04T09:17:11.0478288Z * [new branch] gh/bobrenjc93/685/head -> origin/gh/bobrenjc93/685/head 2025-12-04T09:17:11.0480631Z * [new branch] gh/bobrenjc93/685/orig -> origin/gh/bobrenjc93/685/orig 2025-12-04T09:17:11.0483285Z * [new branch] gh/bobrenjc93/686/base -> origin/gh/bobrenjc93/686/base 2025-12-04T09:17:11.0486864Z * [new branch] gh/bobrenjc93/686/head -> origin/gh/bobrenjc93/686/head 2025-12-04T09:17:11.0487608Z * [new branch] gh/bobrenjc93/686/orig -> origin/gh/bobrenjc93/686/orig 2025-12-04T09:17:11.0488911Z * [new branch] gh/bobrenjc93/687/base -> origin/gh/bobrenjc93/687/base 2025-12-04T09:17:11.0491468Z * [new branch] gh/bobrenjc93/687/head -> origin/gh/bobrenjc93/687/head 2025-12-04T09:17:11.0493194Z * [new branch] gh/bobrenjc93/687/orig -> origin/gh/bobrenjc93/687/orig 2025-12-04T09:17:11.0496453Z * [new branch] gh/bobrenjc93/688/base -> origin/gh/bobrenjc93/688/base 2025-12-04T09:17:11.0498181Z * [new branch] gh/bobrenjc93/688/head -> origin/gh/bobrenjc93/688/head 2025-12-04T09:17:11.0500052Z * [new branch] gh/bobrenjc93/688/orig -> origin/gh/bobrenjc93/688/orig 2025-12-04T09:17:11.0502897Z * [new branch] gh/bobrenjc93/689/base -> origin/gh/bobrenjc93/689/base 2025-12-04T09:17:11.0504867Z * [new branch] gh/bobrenjc93/689/head -> origin/gh/bobrenjc93/689/head 2025-12-04T09:17:11.0506655Z * [new branch] gh/bobrenjc93/689/orig -> origin/gh/bobrenjc93/689/orig 2025-12-04T09:17:11.0509162Z * [new branch] gh/bobrenjc93/690/base -> origin/gh/bobrenjc93/690/base 2025-12-04T09:17:11.0511031Z * [new branch] gh/bobrenjc93/690/head -> origin/gh/bobrenjc93/690/head 2025-12-04T09:17:11.0512841Z * [new branch] gh/bobrenjc93/690/orig -> origin/gh/bobrenjc93/690/orig 2025-12-04T09:17:11.0516069Z * [new branch] gh/bobrenjc93/691/base -> origin/gh/bobrenjc93/691/base 2025-12-04T09:17:11.0518121Z * [new branch] gh/bobrenjc93/691/head -> origin/gh/bobrenjc93/691/head 2025-12-04T09:17:11.0520789Z * [new branch] gh/bobrenjc93/691/orig -> origin/gh/bobrenjc93/691/orig 2025-12-04T09:17:11.0523910Z * [new branch] gh/bobrenjc93/692/base -> origin/gh/bobrenjc93/692/base 2025-12-04T09:17:11.0525767Z * [new branch] gh/bobrenjc93/692/head -> origin/gh/bobrenjc93/692/head 2025-12-04T09:17:11.0527690Z * [new branch] gh/bobrenjc93/692/orig -> origin/gh/bobrenjc93/692/orig 2025-12-04T09:17:11.0530145Z * [new branch] gh/bobrenjc93/693/base -> origin/gh/bobrenjc93/693/base 2025-12-04T09:17:11.0531935Z * [new branch] gh/bobrenjc93/693/head -> origin/gh/bobrenjc93/693/head 2025-12-04T09:17:11.0533790Z * [new branch] gh/bobrenjc93/693/orig -> origin/gh/bobrenjc93/693/orig 2025-12-04T09:17:11.0536388Z * [new branch] gh/bobrenjc93/694/base -> origin/gh/bobrenjc93/694/base 2025-12-04T09:17:11.0538287Z * [new branch] gh/bobrenjc93/694/head -> origin/gh/bobrenjc93/694/head 2025-12-04T09:17:11.0540123Z * [new branch] gh/bobrenjc93/694/orig -> origin/gh/bobrenjc93/694/orig 2025-12-04T09:17:11.0542745Z * [new branch] gh/bobrenjc93/695/base -> origin/gh/bobrenjc93/695/base 2025-12-04T09:17:11.0544569Z * [new branch] gh/bobrenjc93/695/head -> origin/gh/bobrenjc93/695/head 2025-12-04T09:17:11.0546396Z * [new branch] gh/bobrenjc93/695/orig -> origin/gh/bobrenjc93/695/orig 2025-12-04T09:17:11.0549648Z * [new branch] gh/c00w/23/base -> origin/gh/c00w/23/base 2025-12-04T09:17:11.0551487Z * [new branch] gh/c00w/23/head -> origin/gh/c00w/23/head 2025-12-04T09:17:11.0554195Z * [new branch] gh/c00w/53/base -> origin/gh/c00w/53/base 2025-12-04T09:17:11.0555872Z * [new branch] gh/c00w/53/head -> origin/gh/c00w/53/head 2025-12-04T09:17:11.0557702Z * [new branch] gh/c00w/53/orig -> origin/gh/c00w/53/orig 2025-12-04T09:17:11.0560265Z * [new branch] gh/c00w/54/base -> origin/gh/c00w/54/base 2025-12-04T09:17:11.0562161Z * [new branch] gh/c00w/54/head -> origin/gh/c00w/54/head 2025-12-04T09:17:11.0564107Z * [new branch] gh/c00w/54/orig -> origin/gh/c00w/54/orig 2025-12-04T09:17:11.0566471Z * [new branch] gh/c00w/56/base -> origin/gh/c00w/56/base 2025-12-04T09:17:11.0592400Z * [new branch] gh/c00w/56/head -> origin/gh/c00w/56/head 2025-12-04T09:17:11.0592945Z * [new branch] gh/c00w/56/orig -> origin/gh/c00w/56/orig 2025-12-04T09:17:11.0593428Z * [new branch] gh/c00w/57/base -> origin/gh/c00w/57/base 2025-12-04T09:17:11.0593996Z * [new branch] gh/c00w/57/head -> origin/gh/c00w/57/head 2025-12-04T09:17:11.0594457Z * [new branch] gh/c00w/57/orig -> origin/gh/c00w/57/orig 2025-12-04T09:17:11.0594922Z * [new branch] gh/c00w/58/base -> origin/gh/c00w/58/base 2025-12-04T09:17:11.0595378Z * [new branch] gh/c00w/58/head -> origin/gh/c00w/58/head 2025-12-04T09:17:11.0595840Z * [new branch] gh/c00w/58/orig -> origin/gh/c00w/58/orig 2025-12-04T09:17:11.0596320Z * [new branch] gh/clee2000/1/base -> origin/gh/clee2000/1/base 2025-12-04T09:17:11.0596814Z * [new branch] gh/clee2000/1/head -> origin/gh/clee2000/1/head 2025-12-04T09:17:11.0597304Z * [new branch] gh/clee2000/1/orig -> origin/gh/clee2000/1/orig 2025-12-04T09:17:11.0597828Z * [new branch] gh/coconutruben/1/base -> origin/gh/coconutruben/1/base 2025-12-04T09:17:11.0598434Z * [new branch] gh/coconutruben/1/head -> origin/gh/coconutruben/1/head 2025-12-04T09:17:11.0598974Z * [new branch] gh/coconutruben/55/base -> origin/gh/coconutruben/55/base 2025-12-04T09:17:11.0599610Z * [new branch] gh/coconutruben/55/head -> origin/gh/coconutruben/55/head 2025-12-04T09:17:11.0601954Z * [new branch] gh/coconutruben/55/orig -> origin/gh/coconutruben/55/orig 2025-12-04T09:17:11.0604913Z * [new branch] gh/coconutruben/57/base -> origin/gh/coconutruben/57/base 2025-12-04T09:17:11.0606914Z * [new branch] gh/coconutruben/57/head -> origin/gh/coconutruben/57/head 2025-12-04T09:17:11.0608788Z * [new branch] gh/coconutruben/57/orig -> origin/gh/coconutruben/57/orig 2025-12-04T09:17:11.0611353Z * [new branch] gh/coconutruben/70/base -> origin/gh/coconutruben/70/base 2025-12-04T09:17:11.0613350Z * [new branch] gh/coconutruben/70/head -> origin/gh/coconutruben/70/head 2025-12-04T09:17:11.0615220Z * [new branch] gh/coconutruben/70/orig -> origin/gh/coconutruben/70/orig 2025-12-04T09:17:11.0617592Z * [new branch] gh/coconutruben/71/base -> origin/gh/coconutruben/71/base 2025-12-04T09:17:11.0619440Z * [new branch] gh/coconutruben/71/head -> origin/gh/coconutruben/71/head 2025-12-04T09:17:11.0621355Z * [new branch] gh/coconutruben/71/orig -> origin/gh/coconutruben/71/orig 2025-12-04T09:17:11.0624089Z * [new branch] gh/coconutruben/72/base -> origin/gh/coconutruben/72/base 2025-12-04T09:17:11.0625720Z * [new branch] gh/coconutruben/72/head -> origin/gh/coconutruben/72/head 2025-12-04T09:17:11.0627648Z * [new branch] gh/coconutruben/72/orig -> origin/gh/coconutruben/72/orig 2025-12-04T09:17:11.0629966Z * [new branch] gh/coconutruben/73/base -> origin/gh/coconutruben/73/base 2025-12-04T09:17:11.0631810Z * [new branch] gh/coconutruben/73/head -> origin/gh/coconutruben/73/head 2025-12-04T09:17:11.0633697Z * [new branch] gh/coconutruben/73/orig -> origin/gh/coconutruben/73/orig 2025-12-04T09:17:11.0636364Z * [new branch] gh/coconutruben/74/base -> origin/gh/coconutruben/74/base 2025-12-04T09:17:11.0638391Z * [new branch] gh/coconutruben/74/head -> origin/gh/coconutruben/74/head 2025-12-04T09:17:11.0640214Z * [new branch] gh/coconutruben/74/orig -> origin/gh/coconutruben/74/orig 2025-12-04T09:17:11.0642921Z * [new branch] gh/coconutruben/79/base -> origin/gh/coconutruben/79/base 2025-12-04T09:17:11.0644967Z * [new branch] gh/coconutruben/79/head -> origin/gh/coconutruben/79/head 2025-12-04T09:17:11.0646641Z * [new branch] gh/coconutruben/79/orig -> origin/gh/coconutruben/79/orig 2025-12-04T09:17:11.0649396Z * [new branch] gh/coconutruben/80/base -> origin/gh/coconutruben/80/base 2025-12-04T09:17:11.0651326Z * [new branch] gh/coconutruben/80/head -> origin/gh/coconutruben/80/head 2025-12-04T09:17:11.0653553Z * [new branch] gh/coconutruben/80/orig -> origin/gh/coconutruben/80/orig 2025-12-04T09:17:11.0655799Z * [new branch] gh/coconutruben/82/base -> origin/gh/coconutruben/82/base 2025-12-04T09:17:11.0657584Z * [new branch] gh/coconutruben/82/head -> origin/gh/coconutruben/82/head 2025-12-04T09:17:11.0659273Z * [new branch] gh/coconutruben/82/orig -> origin/gh/coconutruben/82/orig 2025-12-04T09:17:11.0662208Z * [new branch] gh/coconutruben/83/base -> origin/gh/coconutruben/83/base 2025-12-04T09:17:11.0663911Z * [new branch] gh/coconutruben/83/head -> origin/gh/coconutruben/83/head 2025-12-04T09:17:11.0665737Z * [new branch] gh/coconutruben/83/orig -> origin/gh/coconutruben/83/orig 2025-12-04T09:17:11.0668431Z * [new branch] gh/coconutruben/84/base -> origin/gh/coconutruben/84/base 2025-12-04T09:17:11.0670528Z * [new branch] gh/coconutruben/84/head -> origin/gh/coconutruben/84/head 2025-12-04T09:17:11.0672202Z * [new branch] gh/coconutruben/84/orig -> origin/gh/coconutruben/84/orig 2025-12-04T09:17:11.0674843Z * [new branch] gh/coconutruben/85/base -> origin/gh/coconutruben/85/base 2025-12-04T09:17:11.0676803Z * [new branch] gh/coconutruben/85/head -> origin/gh/coconutruben/85/head 2025-12-04T09:17:11.0678721Z * [new branch] gh/coconutruben/85/orig -> origin/gh/coconutruben/85/orig 2025-12-04T09:17:11.0681366Z * [new branch] gh/coconutruben/86/base -> origin/gh/coconutruben/86/base 2025-12-04T09:17:11.0683162Z * [new branch] gh/coconutruben/86/head -> origin/gh/coconutruben/86/head 2025-12-04T09:17:11.0685016Z * [new branch] gh/coconutruben/86/orig -> origin/gh/coconutruben/86/orig 2025-12-04T09:17:11.0688126Z * [new branch] gh/colinchan15/1/base -> origin/gh/colinchan15/1/base 2025-12-04T09:17:11.0690066Z * [new branch] gh/colinchan15/1/head -> origin/gh/colinchan15/1/head 2025-12-04T09:17:11.0692455Z * [new branch] gh/colinchan15/2/base -> origin/gh/colinchan15/2/base 2025-12-04T09:17:11.0694259Z * [new branch] gh/colinchan15/2/head -> origin/gh/colinchan15/2/head 2025-12-04T09:17:11.0696730Z * [new branch] gh/colinchan15/3/base -> origin/gh/colinchan15/3/base 2025-12-04T09:17:11.0698627Z * [new branch] gh/colinchan15/3/head -> origin/gh/colinchan15/3/head 2025-12-04T09:17:11.0700980Z * [new branch] gh/colinchan15/6/base -> origin/gh/colinchan15/6/base 2025-12-04T09:17:11.0703325Z * [new branch] gh/colinchan15/6/head -> origin/gh/colinchan15/6/head 2025-12-04T09:17:11.0706500Z * [new branch] gh/d4l3k/1/base -> origin/gh/d4l3k/1/base 2025-12-04T09:17:11.0708094Z * [new branch] gh/d4l3k/1/head -> origin/gh/d4l3k/1/head 2025-12-04T09:17:11.0710699Z * [new branch] gh/d4l3k/2/base -> origin/gh/d4l3k/2/base 2025-12-04T09:17:11.0712498Z * [new branch] gh/d4l3k/2/head -> origin/gh/d4l3k/2/head 2025-12-04T09:17:11.0714246Z * [new branch] gh/d4l3k/2/orig -> origin/gh/d4l3k/2/orig 2025-12-04T09:17:11.0716749Z * [new branch] gh/d4l3k/3/base -> origin/gh/d4l3k/3/base 2025-12-04T09:17:11.0718605Z * [new branch] gh/d4l3k/3/head -> origin/gh/d4l3k/3/head 2025-12-04T09:17:11.0720844Z * [new branch] gh/d4l3k/3/orig -> origin/gh/d4l3k/3/orig 2025-12-04T09:17:11.0723737Z * [new branch] gh/d4l3k/4/base -> origin/gh/d4l3k/4/base 2025-12-04T09:17:11.0725537Z * [new branch] gh/d4l3k/4/head -> origin/gh/d4l3k/4/head 2025-12-04T09:17:11.0727389Z * [new branch] gh/d4l3k/4/orig -> origin/gh/d4l3k/4/orig 2025-12-04T09:17:11.0729846Z * [new branch] gh/d4l3k/5/base -> origin/gh/d4l3k/5/base 2025-12-04T09:17:11.0731720Z * [new branch] gh/d4l3k/5/orig -> origin/gh/d4l3k/5/orig 2025-12-04T09:17:11.0735150Z * [new branch] gh/davidberard98/392/base -> origin/gh/davidberard98/392/base 2025-12-04T09:17:11.0736456Z * [new branch] gh/davidberard98/392/head -> origin/gh/davidberard98/392/head 2025-12-04T09:17:11.0738568Z * [new branch] gh/davidberard98/392/orig -> origin/gh/davidberard98/392/orig 2025-12-04T09:17:11.0741158Z * [new branch] gh/davidberard98/399/base -> origin/gh/davidberard98/399/base 2025-12-04T09:17:11.0743056Z * [new branch] gh/davidberard98/399/head -> origin/gh/davidberard98/399/head 2025-12-04T09:17:11.0744999Z * [new branch] gh/davidberard98/399/orig -> origin/gh/davidberard98/399/orig 2025-12-04T09:17:11.0748085Z * [new branch] gh/desertfire/605/base -> origin/gh/desertfire/605/base 2025-12-04T09:17:11.0750159Z * [new branch] gh/desertfire/605/head -> origin/gh/desertfire/605/head 2025-12-04T09:17:11.0751901Z * [new branch] gh/desertfire/605/orig -> origin/gh/desertfire/605/orig 2025-12-04T09:17:11.0754422Z * [new branch] gh/desertfire/606/base -> origin/gh/desertfire/606/base 2025-12-04T09:17:11.0756234Z * [new branch] gh/desertfire/606/head -> origin/gh/desertfire/606/head 2025-12-04T09:17:11.0758195Z * [new branch] gh/desertfire/606/orig -> origin/gh/desertfire/606/orig 2025-12-04T09:17:11.0760919Z * [new branch] gh/desertfire/607/base -> origin/gh/desertfire/607/base 2025-12-04T09:17:11.0762699Z * [new branch] gh/desertfire/607/head -> origin/gh/desertfire/607/head 2025-12-04T09:17:11.0764502Z * [new branch] gh/desertfire/607/orig -> origin/gh/desertfire/607/orig 2025-12-04T09:17:11.0767101Z * [new branch] gh/desertfire/608/base -> origin/gh/desertfire/608/base 2025-12-04T09:17:11.0769087Z * [new branch] gh/desertfire/608/head -> origin/gh/desertfire/608/head 2025-12-04T09:17:11.0770919Z * [new branch] gh/desertfire/608/orig -> origin/gh/desertfire/608/orig 2025-12-04T09:17:11.0773429Z * [new branch] gh/desertfire/609/base -> origin/gh/desertfire/609/base 2025-12-04T09:17:11.0775288Z * [new branch] gh/desertfire/609/head -> origin/gh/desertfire/609/head 2025-12-04T09:17:11.0777060Z * [new branch] gh/desertfire/609/orig -> origin/gh/desertfire/609/orig 2025-12-04T09:17:11.0779781Z * [new branch] gh/desertfire/610/base -> origin/gh/desertfire/610/base 2025-12-04T09:17:11.0781686Z * [new branch] gh/desertfire/610/head -> origin/gh/desertfire/610/head 2025-12-04T09:17:11.0783573Z * [new branch] gh/desertfire/610/orig -> origin/gh/desertfire/610/orig 2025-12-04T09:17:11.0786056Z * [new branch] gh/desertfire/611/base -> origin/gh/desertfire/611/base 2025-12-04T09:17:11.0788080Z * [new branch] gh/desertfire/611/head -> origin/gh/desertfire/611/head 2025-12-04T09:17:11.0789927Z * [new branch] gh/desertfire/611/orig -> origin/gh/desertfire/611/orig 2025-12-04T09:17:11.0792391Z * [new branch] gh/desertfire/612/base -> origin/gh/desertfire/612/base 2025-12-04T09:17:11.0794480Z * [new branch] gh/desertfire/612/head -> origin/gh/desertfire/612/head 2025-12-04T09:17:11.0796215Z * [new branch] gh/desertfire/612/orig -> origin/gh/desertfire/612/orig 2025-12-04T09:17:11.0798854Z * [new branch] gh/desertfire/613/base -> origin/gh/desertfire/613/base 2025-12-04T09:17:11.0801120Z * [new branch] gh/desertfire/613/head -> origin/gh/desertfire/613/head 2025-12-04T09:17:11.0803134Z * [new branch] gh/desertfire/613/orig -> origin/gh/desertfire/613/orig 2025-12-04T09:17:11.0805786Z * [new branch] gh/desertfire/614/base -> origin/gh/desertfire/614/base 2025-12-04T09:17:11.0807708Z * [new branch] gh/desertfire/614/head -> origin/gh/desertfire/614/head 2025-12-04T09:17:11.0809623Z * [new branch] gh/desertfire/614/orig -> origin/gh/desertfire/614/orig 2025-12-04T09:17:11.0812233Z * [new branch] gh/desertfire/615/base -> origin/gh/desertfire/615/base 2025-12-04T09:17:11.0814327Z * [new branch] gh/desertfire/615/head -> origin/gh/desertfire/615/head 2025-12-04T09:17:11.0816120Z * [new branch] gh/desertfire/615/orig -> origin/gh/desertfire/615/orig 2025-12-04T09:17:11.0818519Z * [new branch] gh/desertfire/616/base -> origin/gh/desertfire/616/base 2025-12-04T09:17:11.0820535Z * [new branch] gh/desertfire/616/head -> origin/gh/desertfire/616/head 2025-12-04T09:17:11.0822293Z * [new branch] gh/desertfire/616/orig -> origin/gh/desertfire/616/orig 2025-12-04T09:17:11.0824697Z * [new branch] gh/desertfire/617/base -> origin/gh/desertfire/617/base 2025-12-04T09:17:11.0826700Z * [new branch] gh/desertfire/617/head -> origin/gh/desertfire/617/head 2025-12-04T09:17:11.0828515Z * [new branch] gh/desertfire/617/orig -> origin/gh/desertfire/617/orig 2025-12-04T09:17:11.0831560Z * [new branch] gh/dharakk/1/base -> origin/gh/dharakk/1/base 2025-12-04T09:17:11.0833462Z * [new branch] gh/dharakk/1/head -> origin/gh/dharakk/1/head 2025-12-04T09:17:11.0836494Z * [new branch] gh/drisspg/170/base -> origin/gh/drisspg/170/base 2025-12-04T09:17:11.0838365Z * [new branch] gh/drisspg/170/head -> origin/gh/drisspg/170/head 2025-12-04T09:17:11.0840356Z * [new branch] gh/drisspg/170/orig -> origin/gh/drisspg/170/orig 2025-12-04T09:17:11.0842910Z * [new branch] gh/drisspg/182/base -> origin/gh/drisspg/182/base 2025-12-04T09:17:11.0844745Z * [new branch] gh/drisspg/182/head -> origin/gh/drisspg/182/head 2025-12-04T09:17:11.0847123Z * [new branch] gh/drisspg/183/base -> origin/gh/drisspg/183/base 2025-12-04T09:17:11.0848871Z * [new branch] gh/drisspg/183/head -> origin/gh/drisspg/183/head 2025-12-04T09:17:11.0851262Z * [new branch] gh/drisspg/184/base -> origin/gh/drisspg/184/base 2025-12-04T09:17:11.0853003Z * [new branch] gh/drisspg/184/head -> origin/gh/drisspg/184/head 2025-12-04T09:17:11.0855553Z * [new branch] gh/drisspg/185/base -> origin/gh/drisspg/185/base 2025-12-04T09:17:11.0857377Z * [new branch] gh/drisspg/185/head -> origin/gh/drisspg/185/head 2025-12-04T09:17:11.0859922Z * [new branch] gh/drisspg/194/base -> origin/gh/drisspg/194/base 2025-12-04T09:17:11.0861745Z * [new branch] gh/drisspg/194/head -> origin/gh/drisspg/194/head 2025-12-04T09:17:11.0863623Z * [new branch] gh/drisspg/194/orig -> origin/gh/drisspg/194/orig 2025-12-04T09:17:11.0866223Z * [new branch] gh/drisspg/200/base -> origin/gh/drisspg/200/base 2025-12-04T09:17:11.0868031Z * [new branch] gh/drisspg/200/head -> origin/gh/drisspg/200/head 2025-12-04T09:17:11.0870076Z * [new branch] gh/drisspg/200/orig -> origin/gh/drisspg/200/orig 2025-12-04T09:17:11.0872390Z * [new branch] gh/drisspg/218/base -> origin/gh/drisspg/218/base 2025-12-04T09:17:11.0874157Z * [new branch] gh/drisspg/218/head -> origin/gh/drisspg/218/head 2025-12-04T09:17:11.0876021Z * [new branch] gh/drisspg/218/orig -> origin/gh/drisspg/218/orig 2025-12-04T09:17:11.0878589Z * [new branch] gh/drisspg/219/base -> origin/gh/drisspg/219/base 2025-12-04T09:17:11.0880596Z * [new branch] gh/drisspg/219/head -> origin/gh/drisspg/219/head 2025-12-04T09:17:11.0882519Z * [new branch] gh/drisspg/219/orig -> origin/gh/drisspg/219/orig 2025-12-04T09:17:11.0885019Z * [new branch] gh/drisspg/220/base -> origin/gh/drisspg/220/base 2025-12-04T09:17:11.0886855Z * [new branch] gh/drisspg/220/head -> origin/gh/drisspg/220/head 2025-12-04T09:17:11.0888749Z * [new branch] gh/drisspg/220/orig -> origin/gh/drisspg/220/orig 2025-12-04T09:17:11.0891338Z * [new branch] gh/drisspg/221/base -> origin/gh/drisspg/221/base 2025-12-04T09:17:11.0893210Z * [new branch] gh/drisspg/221/head -> origin/gh/drisspg/221/head 2025-12-04T09:17:11.0895043Z * [new branch] gh/drisspg/221/orig -> origin/gh/drisspg/221/orig 2025-12-04T09:17:11.0897527Z * [new branch] gh/drisspg/222/base -> origin/gh/drisspg/222/base 2025-12-04T09:17:11.0899393Z * [new branch] gh/drisspg/222/head -> origin/gh/drisspg/222/head 2025-12-04T09:17:11.0901271Z * [new branch] gh/drisspg/222/orig -> origin/gh/drisspg/222/orig 2025-12-04T09:17:11.0904244Z * [new branch] gh/drisspg/223/base -> origin/gh/drisspg/223/base 2025-12-04T09:17:11.0906084Z * [new branch] gh/drisspg/223/head -> origin/gh/drisspg/223/head 2025-12-04T09:17:11.0907941Z * [new branch] gh/drisspg/223/orig -> origin/gh/drisspg/223/orig 2025-12-04T09:17:11.0910487Z * [new branch] gh/drisspg/224/base -> origin/gh/drisspg/224/base 2025-12-04T09:17:11.0912288Z * [new branch] gh/drisspg/224/head -> origin/gh/drisspg/224/head 2025-12-04T09:17:11.0914103Z * [new branch] gh/drisspg/224/orig -> origin/gh/drisspg/224/orig 2025-12-04T09:17:11.0916714Z * [new branch] gh/drisspg/225/base -> origin/gh/drisspg/225/base 2025-12-04T09:17:11.0918598Z * [new branch] gh/drisspg/225/head -> origin/gh/drisspg/225/head 2025-12-04T09:17:11.0920475Z * [new branch] gh/drisspg/225/orig -> origin/gh/drisspg/225/orig 2025-12-04T09:17:11.0923070Z * [new branch] gh/drisspg/226/base -> origin/gh/drisspg/226/base 2025-12-04T09:17:11.0924844Z * [new branch] gh/drisspg/226/head -> origin/gh/drisspg/226/head 2025-12-04T09:17:11.0926672Z * [new branch] gh/drisspg/226/orig -> origin/gh/drisspg/226/orig 2025-12-04T09:17:11.0929811Z * [new branch] gh/drisspg/227/base -> origin/gh/drisspg/227/base 2025-12-04T09:17:11.0931636Z * [new branch] gh/drisspg/227/head -> origin/gh/drisspg/227/head 2025-12-04T09:17:11.0933413Z * [new branch] gh/drisspg/227/orig -> origin/gh/drisspg/227/orig 2025-12-04T09:17:11.0936026Z * [new branch] gh/drisspg/228/base -> origin/gh/drisspg/228/base 2025-12-04T09:17:11.0937845Z * [new branch] gh/drisspg/228/head -> origin/gh/drisspg/228/head 2025-12-04T09:17:11.0939688Z * [new branch] gh/drisspg/228/orig -> origin/gh/drisspg/228/orig 2025-12-04T09:17:11.0942446Z * [new branch] gh/drisspg/229/base -> origin/gh/drisspg/229/base 2025-12-04T09:17:11.0944331Z * [new branch] gh/drisspg/229/head -> origin/gh/drisspg/229/head 2025-12-04T09:17:11.0946292Z * [new branch] gh/drisspg/229/orig -> origin/gh/drisspg/229/orig 2025-12-04T09:17:11.0948868Z * [new branch] gh/drisspg/230/base -> origin/gh/drisspg/230/base 2025-12-04T09:17:11.0950677Z * [new branch] gh/drisspg/230/head -> origin/gh/drisspg/230/head 2025-12-04T09:17:11.0952316Z * [new branch] gh/drisspg/230/orig -> origin/gh/drisspg/230/orig 2025-12-04T09:17:11.0955583Z * [new branch] gh/dsjohns2/1/base -> origin/gh/dsjohns2/1/base 2025-12-04T09:17:11.0957500Z * [new branch] gh/dsjohns2/1/head -> origin/gh/dsjohns2/1/head 2025-12-04T09:17:11.0961446Z * [new branch] gh/dzmitry-huba/1/base -> origin/gh/dzmitry-huba/1/base 2025-12-04T09:17:11.0963305Z * [new branch] gh/dzmitry-huba/1/head -> origin/gh/dzmitry-huba/1/head 2025-12-04T09:17:11.0966156Z * [new branch] gh/dzmitry-huba/12/base -> origin/gh/dzmitry-huba/12/base 2025-12-04T09:17:11.0968072Z * [new branch] gh/dzmitry-huba/12/head -> origin/gh/dzmitry-huba/12/head 2025-12-04T09:17:11.0969864Z * [new branch] gh/dzmitry-huba/12/orig -> origin/gh/dzmitry-huba/12/orig 2025-12-04T09:17:11.0972567Z * [new branch] gh/dzmitry-huba/13/base -> origin/gh/dzmitry-huba/13/base 2025-12-04T09:17:11.0974456Z * [new branch] gh/dzmitry-huba/13/head -> origin/gh/dzmitry-huba/13/head 2025-12-04T09:17:11.0976293Z * [new branch] gh/dzmitry-huba/13/orig -> origin/gh/dzmitry-huba/13/orig 2025-12-04T09:17:11.0978785Z * [new branch] gh/dzmitry-huba/14/base -> origin/gh/dzmitry-huba/14/base 2025-12-04T09:17:11.0980664Z * [new branch] gh/dzmitry-huba/14/head -> origin/gh/dzmitry-huba/14/head 2025-12-04T09:17:11.0982497Z * [new branch] gh/dzmitry-huba/14/orig -> origin/gh/dzmitry-huba/14/orig 2025-12-04T09:17:11.0985096Z * [new branch] gh/dzmitry-huba/15/base -> origin/gh/dzmitry-huba/15/base 2025-12-04T09:17:11.0986958Z * [new branch] gh/dzmitry-huba/15/head -> origin/gh/dzmitry-huba/15/head 2025-12-04T09:17:11.0988785Z * [new branch] gh/dzmitry-huba/15/orig -> origin/gh/dzmitry-huba/15/orig 2025-12-04T09:17:11.0991606Z * [new branch] gh/dzmitry-huba/16/base -> origin/gh/dzmitry-huba/16/base 2025-12-04T09:17:11.0993524Z * [new branch] gh/dzmitry-huba/16/head -> origin/gh/dzmitry-huba/16/head 2025-12-04T09:17:11.0995397Z * [new branch] gh/dzmitry-huba/16/orig -> origin/gh/dzmitry-huba/16/orig 2025-12-04T09:17:11.0997969Z * [new branch] gh/dzmitry-huba/17/base -> origin/gh/dzmitry-huba/17/base 2025-12-04T09:17:11.0999747Z * [new branch] gh/dzmitry-huba/17/head -> origin/gh/dzmitry-huba/17/head 2025-12-04T09:17:11.1002079Z * [new branch] gh/dzmitry-huba/17/orig -> origin/gh/dzmitry-huba/17/orig 2025-12-04T09:17:11.1004534Z * [new branch] gh/dzmitry-huba/2/base -> origin/gh/dzmitry-huba/2/base 2025-12-04T09:17:11.1006389Z * [new branch] gh/dzmitry-huba/2/head -> origin/gh/dzmitry-huba/2/head 2025-12-04T09:17:11.1008832Z * [new branch] gh/dzmitry-huba/3/base -> origin/gh/dzmitry-huba/3/base 2025-12-04T09:17:11.1010592Z * [new branch] gh/dzmitry-huba/3/head -> origin/gh/dzmitry-huba/3/head 2025-12-04T09:17:11.1013817Z * [new branch] gh/eellison/808/base -> origin/gh/eellison/808/base 2025-12-04T09:17:11.1015796Z * [new branch] gh/eellison/808/head -> origin/gh/eellison/808/head 2025-12-04T09:17:11.1017612Z * [new branch] gh/eellison/808/orig -> origin/gh/eellison/808/orig 2025-12-04T09:17:11.1020472Z * [new branch] gh/eellison/822/base -> origin/gh/eellison/822/base 2025-12-04T09:17:11.1022449Z * [new branch] gh/eellison/822/head -> origin/gh/eellison/822/head 2025-12-04T09:17:11.1024209Z * [new branch] gh/eellison/822/orig -> origin/gh/eellison/822/orig 2025-12-04T09:17:11.1026712Z * [new branch] gh/eellison/823/base -> origin/gh/eellison/823/base 2025-12-04T09:17:11.1028640Z * [new branch] gh/eellison/823/head -> origin/gh/eellison/823/head 2025-12-04T09:17:11.1030424Z * [new branch] gh/eellison/823/orig -> origin/gh/eellison/823/orig 2025-12-04T09:17:11.1032910Z * [new branch] gh/eellison/862/base -> origin/gh/eellison/862/base 2025-12-04T09:17:11.1034758Z * [new branch] gh/eellison/862/head -> origin/gh/eellison/862/head 2025-12-04T09:17:11.1036536Z * [new branch] gh/eellison/862/orig -> origin/gh/eellison/862/orig 2025-12-04T09:17:11.1039121Z * [new branch] gh/eellison/863/base -> origin/gh/eellison/863/base 2025-12-04T09:17:11.1041247Z * [new branch] gh/eellison/863/head -> origin/gh/eellison/863/head 2025-12-04T09:17:11.1043087Z * [new branch] gh/eellison/863/orig -> origin/gh/eellison/863/orig 2025-12-04T09:17:11.1045457Z * [new branch] gh/eellison/864/base -> origin/gh/eellison/864/base 2025-12-04T09:17:11.1047374Z * [new branch] gh/eellison/864/head -> origin/gh/eellison/864/head 2025-12-04T09:17:11.1049836Z * [new branch] gh/eellison/864/orig -> origin/gh/eellison/864/orig 2025-12-04T09:17:11.1052409Z * [new branch] gh/eellison/865/base -> origin/gh/eellison/865/base 2025-12-04T09:17:11.1053915Z * [new branch] gh/eellison/865/head -> origin/gh/eellison/865/head 2025-12-04T09:17:11.1055774Z * [new branch] gh/eellison/865/orig -> origin/gh/eellison/865/orig 2025-12-04T09:17:11.1058311Z * [new branch] gh/eellison/866/base -> origin/gh/eellison/866/base 2025-12-04T09:17:11.1060111Z * [new branch] gh/eellison/866/head -> origin/gh/eellison/866/head 2025-12-04T09:17:11.1061928Z * [new branch] gh/eellison/866/orig -> origin/gh/eellison/866/orig 2025-12-04T09:17:11.1064607Z * [new branch] gh/eellison/867/base -> origin/gh/eellison/867/base 2025-12-04T09:17:11.1066472Z * [new branch] gh/eellison/867/head -> origin/gh/eellison/867/head 2025-12-04T09:17:11.1068393Z * [new branch] gh/eellison/867/orig -> origin/gh/eellison/867/orig 2025-12-04T09:17:11.1071107Z * [new branch] gh/eellison/868/base -> origin/gh/eellison/868/base 2025-12-04T09:17:11.1073213Z * [new branch] gh/eellison/868/head -> origin/gh/eellison/868/head 2025-12-04T09:17:11.1074931Z * [new branch] gh/eellison/868/orig -> origin/gh/eellison/868/orig 2025-12-04T09:17:11.1077483Z * [new branch] gh/eellison/869/base -> origin/gh/eellison/869/base 2025-12-04T09:17:11.1079476Z * [new branch] gh/eellison/869/head -> origin/gh/eellison/869/head 2025-12-04T09:17:11.1081415Z * [new branch] gh/eellison/869/orig -> origin/gh/eellison/869/orig 2025-12-04T09:17:11.1083952Z * [new branch] gh/eellison/870/base -> origin/gh/eellison/870/base 2025-12-04T09:17:11.1085753Z * [new branch] gh/eellison/870/head -> origin/gh/eellison/870/head 2025-12-04T09:17:11.1087552Z * [new branch] gh/eellison/870/orig -> origin/gh/eellison/870/orig 2025-12-04T09:17:11.1090647Z * [new branch] gh/eellison/871/base -> origin/gh/eellison/871/base 2025-12-04T09:17:11.1092049Z * [new branch] gh/eellison/871/head -> origin/gh/eellison/871/head 2025-12-04T09:17:11.1093862Z * [new branch] gh/eellison/871/orig -> origin/gh/eellison/871/orig 2025-12-04T09:17:11.1096561Z * [new branch] gh/eellison/872/base -> origin/gh/eellison/872/base 2025-12-04T09:17:11.1098375Z * [new branch] gh/eellison/872/head -> origin/gh/eellison/872/head 2025-12-04T09:17:11.1100175Z * [new branch] gh/eellison/872/orig -> origin/gh/eellison/872/orig 2025-12-04T09:17:11.1105827Z * [new branch] gh/eellison/873/base -> origin/gh/eellison/873/base 2025-12-04T09:17:11.1107603Z * [new branch] gh/eellison/873/head -> origin/gh/eellison/873/head 2025-12-04T09:17:11.1109531Z * [new branch] gh/eellison/873/orig -> origin/gh/eellison/873/orig 2025-12-04T09:17:11.1112060Z * [new branch] gh/eellison/874/base -> origin/gh/eellison/874/base 2025-12-04T09:17:11.1113859Z * [new branch] gh/eellison/874/head -> origin/gh/eellison/874/head 2025-12-04T09:17:11.1115683Z * [new branch] gh/eellison/874/orig -> origin/gh/eellison/874/orig 2025-12-04T09:17:11.1118828Z * [new branch] gh/eellison/875/base -> origin/gh/eellison/875/base 2025-12-04T09:17:11.1120986Z * [new branch] gh/eellison/875/head -> origin/gh/eellison/875/head 2025-12-04T09:17:11.1122832Z * [new branch] gh/eellison/875/orig -> origin/gh/eellison/875/orig 2025-12-04T09:17:11.1125495Z * [new branch] gh/eellison/876/base -> origin/gh/eellison/876/base 2025-12-04T09:17:11.1127378Z * [new branch] gh/eellison/876/head -> origin/gh/eellison/876/head 2025-12-04T09:17:11.1129250Z * [new branch] gh/eellison/876/orig -> origin/gh/eellison/876/orig 2025-12-04T09:17:11.1131856Z * [new branch] gh/eellison/877/base -> origin/gh/eellison/877/base 2025-12-04T09:17:11.1133724Z * [new branch] gh/eellison/877/head -> origin/gh/eellison/877/head 2025-12-04T09:17:11.1135543Z * [new branch] gh/eellison/877/orig -> origin/gh/eellison/877/orig 2025-12-04T09:17:11.1138175Z * [new branch] gh/eellison/878/base -> origin/gh/eellison/878/base 2025-12-04T09:17:11.1140036Z * [new branch] gh/eellison/878/head -> origin/gh/eellison/878/head 2025-12-04T09:17:11.1141871Z * [new branch] gh/eellison/878/orig -> origin/gh/eellison/878/orig 2025-12-04T09:17:11.1144506Z * [new branch] gh/eellison/879/base -> origin/gh/eellison/879/base 2025-12-04T09:17:11.1146446Z * [new branch] gh/eellison/879/head -> origin/gh/eellison/879/head 2025-12-04T09:17:11.1148436Z * [new branch] gh/eellison/879/orig -> origin/gh/eellison/879/orig 2025-12-04T09:17:11.1150874Z * [new branch] gh/eellison/880/base -> origin/gh/eellison/880/base 2025-12-04T09:17:11.1152771Z * [new branch] gh/eellison/880/head -> origin/gh/eellison/880/head 2025-12-04T09:17:11.1154615Z * [new branch] gh/eellison/880/orig -> origin/gh/eellison/880/orig 2025-12-04T09:17:11.1157279Z * [new branch] gh/eellison/881/base -> origin/gh/eellison/881/base 2025-12-04T09:17:11.1159165Z * [new branch] gh/eellison/881/head -> origin/gh/eellison/881/head 2025-12-04T09:17:11.1161104Z * [new branch] gh/eellison/881/orig -> origin/gh/eellison/881/orig 2025-12-04T09:17:11.1163695Z * [new branch] gh/eellison/882/base -> origin/gh/eellison/882/base 2025-12-04T09:17:11.1165435Z * [new branch] gh/eellison/882/head -> origin/gh/eellison/882/head 2025-12-04T09:17:11.1167451Z * [new branch] gh/eellison/882/orig -> origin/gh/eellison/882/orig 2025-12-04T09:17:11.1170008Z * [new branch] gh/eellison/883/base -> origin/gh/eellison/883/base 2025-12-04T09:17:11.1171968Z * [new branch] gh/eellison/883/head -> origin/gh/eellison/883/head 2025-12-04T09:17:11.1173908Z * [new branch] gh/eellison/883/orig -> origin/gh/eellison/883/orig 2025-12-04T09:17:11.1176216Z * [new branch] gh/eellison/884/base -> origin/gh/eellison/884/base 2025-12-04T09:17:11.1178071Z * [new branch] gh/eellison/884/head -> origin/gh/eellison/884/head 2025-12-04T09:17:11.1179827Z * [new branch] gh/eellison/884/orig -> origin/gh/eellison/884/orig 2025-12-04T09:17:11.1182914Z * [new branch] gh/etaf/147/base -> origin/gh/etaf/147/base 2025-12-04T09:17:11.1184889Z * [new branch] gh/etaf/147/head -> origin/gh/etaf/147/head 2025-12-04T09:17:11.1187569Z * [new branch] gh/etaf/154/base -> origin/gh/etaf/154/base 2025-12-04T09:17:11.1189508Z * [new branch] gh/etaf/154/head -> origin/gh/etaf/154/head 2025-12-04T09:17:11.1191285Z * [new branch] gh/etaf/154/orig -> origin/gh/etaf/154/orig 2025-12-04T09:17:11.1193781Z * [new branch] gh/etaf/156/base -> origin/gh/etaf/156/base 2025-12-04T09:17:11.1195774Z * [new branch] gh/etaf/156/head -> origin/gh/etaf/156/head 2025-12-04T09:17:11.1197524Z * [new branch] gh/etaf/156/orig -> origin/gh/etaf/156/orig 2025-12-04T09:17:11.1200757Z * [new branch] gh/etaf/157/base -> origin/gh/etaf/157/base 2025-12-04T09:17:11.1202728Z * [new branch] gh/etaf/157/head -> origin/gh/etaf/157/head 2025-12-04T09:17:11.1204540Z * [new branch] gh/etaf/157/orig -> origin/gh/etaf/157/orig 2025-12-04T09:17:11.1207026Z * [new branch] gh/etaf/158/base -> origin/gh/etaf/158/base 2025-12-04T09:17:11.1209053Z * [new branch] gh/etaf/158/head -> origin/gh/etaf/158/head 2025-12-04T09:17:11.1210826Z * [new branch] gh/etaf/158/orig -> origin/gh/etaf/158/orig 2025-12-04T09:17:11.1213383Z * [new branch] gh/etaf/159/base -> origin/gh/etaf/159/base 2025-12-04T09:17:11.1215402Z * [new branch] gh/etaf/159/head -> origin/gh/etaf/159/head 2025-12-04T09:17:11.1217192Z * [new branch] gh/etaf/159/orig -> origin/gh/etaf/159/orig 2025-12-04T09:17:11.1220656Z * [new branch] gh/etaf/160/base -> origin/gh/etaf/160/base 2025-12-04T09:17:11.1223291Z * [new branch] gh/etaf/160/head -> origin/gh/etaf/160/head 2025-12-04T09:17:11.1225179Z * [new branch] gh/etaf/160/orig -> origin/gh/etaf/160/orig 2025-12-04T09:17:11.1227711Z * [new branch] gh/etaf/161/base -> origin/gh/etaf/161/base 2025-12-04T09:17:11.1229726Z * [new branch] gh/etaf/161/head -> origin/gh/etaf/161/head 2025-12-04T09:17:11.1231553Z * [new branch] gh/etaf/161/orig -> origin/gh/etaf/161/orig 2025-12-04T09:17:11.1234062Z * [new branch] gh/etaf/166/base -> origin/gh/etaf/166/base 2025-12-04T09:17:11.1236056Z * [new branch] gh/etaf/166/head -> origin/gh/etaf/166/head 2025-12-04T09:17:11.1237903Z * [new branch] gh/etaf/166/orig -> origin/gh/etaf/166/orig 2025-12-04T09:17:11.1240688Z * [new branch] gh/etaf/167/base -> origin/gh/etaf/167/base 2025-12-04T09:17:11.1242414Z * [new branch] gh/etaf/167/head -> origin/gh/etaf/167/head 2025-12-04T09:17:11.1244211Z * [new branch] gh/etaf/167/orig -> origin/gh/etaf/167/orig 2025-12-04T09:17:11.1246872Z * [new branch] gh/etaf/168/base -> origin/gh/etaf/168/base 2025-12-04T09:17:11.1248975Z * [new branch] gh/etaf/168/head -> origin/gh/etaf/168/head 2025-12-04T09:17:11.1250796Z * [new branch] gh/etaf/168/orig -> origin/gh/etaf/168/orig 2025-12-04T09:17:11.1253512Z * [new branch] gh/etaf/172/base -> origin/gh/etaf/172/base 2025-12-04T09:17:11.1255273Z * [new branch] gh/etaf/172/head -> origin/gh/etaf/172/head 2025-12-04T09:17:11.1257104Z * [new branch] gh/etaf/172/orig -> origin/gh/etaf/172/orig 2025-12-04T09:17:11.1259857Z * [new branch] gh/etaf/173/base -> origin/gh/etaf/173/base 2025-12-04T09:17:11.1261835Z * [new branch] gh/etaf/173/head -> origin/gh/etaf/173/head 2025-12-04T09:17:11.1263708Z * [new branch] gh/etaf/173/orig -> origin/gh/etaf/173/orig 2025-12-04T09:17:11.1266295Z * [new branch] gh/etaf/174/base -> origin/gh/etaf/174/base 2025-12-04T09:17:11.1268163Z * [new branch] gh/etaf/174/head -> origin/gh/etaf/174/head 2025-12-04T09:17:11.1270758Z * [new branch] gh/etaf/175/base -> origin/gh/etaf/175/base 2025-12-04T09:17:11.1272531Z * [new branch] gh/etaf/175/head -> origin/gh/etaf/175/head 2025-12-04T09:17:11.1274344Z * [new branch] gh/etaf/175/orig -> origin/gh/etaf/175/orig 2025-12-04T09:17:11.1276921Z * [new branch] gh/etaf/176/base -> origin/gh/etaf/176/base 2025-12-04T09:17:11.1278816Z * [new branch] gh/etaf/176/head -> origin/gh/etaf/176/head 2025-12-04T09:17:11.1280933Z * [new branch] gh/etaf/176/orig -> origin/gh/etaf/176/orig 2025-12-04T09:17:11.1284513Z * [new branch] gh/etaf/177/base -> origin/gh/etaf/177/base 2025-12-04T09:17:11.1286470Z * [new branch] gh/etaf/177/head -> origin/gh/etaf/177/head 2025-12-04T09:17:11.1288511Z * [new branch] gh/etaf/177/orig -> origin/gh/etaf/177/orig 2025-12-04T09:17:11.1291123Z * [new branch] gh/etaf/178/base -> origin/gh/etaf/178/base 2025-12-04T09:17:11.1293163Z * [new branch] gh/etaf/178/head -> origin/gh/etaf/178/head 2025-12-04T09:17:11.1294940Z * [new branch] gh/etaf/178/orig -> origin/gh/etaf/178/orig 2025-12-04T09:17:11.1297560Z * [new branch] gh/etaf/179/base -> origin/gh/etaf/179/base 2025-12-04T09:17:11.1299442Z * [new branch] gh/etaf/179/head -> origin/gh/etaf/179/head 2025-12-04T09:17:11.1305444Z * [new branch] gh/etaf/179/orig -> origin/gh/etaf/179/orig 2025-12-04T09:17:11.1307945Z * [new branch] gh/etaf/180/base -> origin/gh/etaf/180/base 2025-12-04T09:17:11.1309899Z * [new branch] gh/etaf/180/head -> origin/gh/etaf/180/head 2025-12-04T09:17:11.1311745Z * [new branch] gh/etaf/180/orig -> origin/gh/etaf/180/orig 2025-12-04T09:17:11.1315431Z * [new branch] gh/exclamaforte/1/base -> origin/gh/exclamaforte/1/base 2025-12-04T09:17:11.1317213Z * [new branch] gh/exclamaforte/1/head -> origin/gh/exclamaforte/1/head 2025-12-04T09:17:11.1319756Z * [new branch] gh/exclamaforte/2/base -> origin/gh/exclamaforte/2/base 2025-12-04T09:17:11.1321602Z * [new branch] gh/exclamaforte/2/head -> origin/gh/exclamaforte/2/head 2025-12-04T09:17:11.1324100Z * [new branch] gh/exclamaforte/3/base -> origin/gh/exclamaforte/3/base 2025-12-04T09:17:11.1325977Z * [new branch] gh/exclamaforte/3/head -> origin/gh/exclamaforte/3/head 2025-12-04T09:17:11.1328577Z * [new branch] gh/exclamaforte/4/base -> origin/gh/exclamaforte/4/base 2025-12-04T09:17:11.1330428Z * [new branch] gh/exclamaforte/4/head -> origin/gh/exclamaforte/4/head 2025-12-04T09:17:11.1333595Z * [new branch] gh/ezyang/2374/base -> origin/gh/ezyang/2374/base 2025-12-04T09:17:11.1335536Z * [new branch] gh/ezyang/2374/head -> origin/gh/ezyang/2374/head 2025-12-04T09:17:11.1337499Z * [new branch] gh/ezyang/2374/orig -> origin/gh/ezyang/2374/orig 2025-12-04T09:17:11.1339961Z * [new branch] gh/ezyang/2973/base -> origin/gh/ezyang/2973/base 2025-12-04T09:17:11.1341783Z * [new branch] gh/ezyang/2973/head -> origin/gh/ezyang/2973/head 2025-12-04T09:17:11.1343652Z * [new branch] gh/ezyang/2973/orig -> origin/gh/ezyang/2973/orig 2025-12-04T09:17:11.1346117Z * [new branch] gh/ezyang/2974/base -> origin/gh/ezyang/2974/base 2025-12-04T09:17:11.1347891Z * [new branch] gh/ezyang/2974/head -> origin/gh/ezyang/2974/head 2025-12-04T09:17:11.1349960Z * [new branch] gh/ezyang/2974/orig -> origin/gh/ezyang/2974/orig 2025-12-04T09:17:11.1352576Z * [new branch] gh/ezyang/3131/base -> origin/gh/ezyang/3131/base 2025-12-04T09:17:11.1354340Z * [new branch] gh/ezyang/3131/head -> origin/gh/ezyang/3131/head 2025-12-04T09:17:11.1356197Z * [new branch] gh/ezyang/3131/orig -> origin/gh/ezyang/3131/orig 2025-12-04T09:17:11.1358745Z * [new branch] gh/ezyang/3139/base -> origin/gh/ezyang/3139/base 2025-12-04T09:17:11.1360749Z * [new branch] gh/ezyang/3139/head -> origin/gh/ezyang/3139/head 2025-12-04T09:17:11.1362573Z * [new branch] gh/ezyang/3139/orig -> origin/gh/ezyang/3139/orig 2025-12-04T09:17:11.1365034Z * [new branch] gh/ezyang/3140/base -> origin/gh/ezyang/3140/base 2025-12-04T09:17:11.1366881Z * [new branch] gh/ezyang/3140/head -> origin/gh/ezyang/3140/head 2025-12-04T09:17:11.1368793Z * [new branch] gh/ezyang/3140/orig -> origin/gh/ezyang/3140/orig 2025-12-04T09:17:11.1371330Z * [new branch] gh/ezyang/3143/base -> origin/gh/ezyang/3143/base 2025-12-04T09:17:11.1373136Z * [new branch] gh/ezyang/3143/head -> origin/gh/ezyang/3143/head 2025-12-04T09:17:11.1375035Z * [new branch] gh/ezyang/3143/orig -> origin/gh/ezyang/3143/orig 2025-12-04T09:17:11.1377645Z * [new branch] gh/ezyang/3144/base -> origin/gh/ezyang/3144/base 2025-12-04T09:17:11.1379580Z * [new branch] gh/ezyang/3144/head -> origin/gh/ezyang/3144/head 2025-12-04T09:17:11.1381454Z * [new branch] gh/ezyang/3144/orig -> origin/gh/ezyang/3144/orig 2025-12-04T09:17:11.1383965Z * [new branch] gh/ezyang/3167/base -> origin/gh/ezyang/3167/base 2025-12-04T09:17:11.1385786Z * [new branch] gh/ezyang/3167/head -> origin/gh/ezyang/3167/head 2025-12-04T09:17:11.1387909Z * [new branch] gh/ezyang/3167/orig -> origin/gh/ezyang/3167/orig 2025-12-04T09:17:11.1390592Z * [new branch] gh/ezyang/3173/base -> origin/gh/ezyang/3173/base 2025-12-04T09:17:11.1392301Z * [new branch] gh/ezyang/3173/head -> origin/gh/ezyang/3173/head 2025-12-04T09:17:11.1394137Z * [new branch] gh/ezyang/3173/orig -> origin/gh/ezyang/3173/orig 2025-12-04T09:17:11.1396665Z * [new branch] gh/ezyang/3175/base -> origin/gh/ezyang/3175/base 2025-12-04T09:17:11.1398597Z * [new branch] gh/ezyang/3175/head -> origin/gh/ezyang/3175/head 2025-12-04T09:17:11.1400680Z * [new branch] gh/ezyang/3175/orig -> origin/gh/ezyang/3175/orig 2025-12-04T09:17:11.1405321Z * [new branch] gh/ezyang/3182/base -> origin/gh/ezyang/3182/base 2025-12-04T09:17:11.1407127Z * [new branch] gh/ezyang/3182/head -> origin/gh/ezyang/3182/head 2025-12-04T09:17:11.1409030Z * [new branch] gh/ezyang/3182/orig -> origin/gh/ezyang/3182/orig 2025-12-04T09:17:11.1411522Z * [new branch] gh/ezyang/3185/base -> origin/gh/ezyang/3185/base 2025-12-04T09:17:11.1413512Z * [new branch] gh/ezyang/3185/head -> origin/gh/ezyang/3185/head 2025-12-04T09:17:11.1415241Z * [new branch] gh/ezyang/3185/orig -> origin/gh/ezyang/3185/orig 2025-12-04T09:17:11.1417752Z * [new branch] gh/ezyang/3189/base -> origin/gh/ezyang/3189/base 2025-12-04T09:17:11.1419567Z * [new branch] gh/ezyang/3189/head -> origin/gh/ezyang/3189/head 2025-12-04T09:17:11.1421381Z * [new branch] gh/ezyang/3189/orig -> origin/gh/ezyang/3189/orig 2025-12-04T09:17:11.1423943Z * [new branch] gh/ezyang/3191/base -> origin/gh/ezyang/3191/base 2025-12-04T09:17:11.1425792Z * [new branch] gh/ezyang/3191/head -> origin/gh/ezyang/3191/head 2025-12-04T09:17:11.1427607Z * [new branch] gh/ezyang/3191/orig -> origin/gh/ezyang/3191/orig 2025-12-04T09:17:11.1430894Z * [new branch] gh/ezyang/3192/base -> origin/gh/ezyang/3192/base 2025-12-04T09:17:11.1432741Z * [new branch] gh/ezyang/3192/head -> origin/gh/ezyang/3192/head 2025-12-04T09:17:11.1434641Z * [new branch] gh/ezyang/3192/orig -> origin/gh/ezyang/3192/orig 2025-12-04T09:17:11.1437211Z * [new branch] gh/ezyang/3193/base -> origin/gh/ezyang/3193/base 2025-12-04T09:17:11.1439127Z * [new branch] gh/ezyang/3193/head -> origin/gh/ezyang/3193/head 2025-12-04T09:17:11.1441076Z * [new branch] gh/ezyang/3193/orig -> origin/gh/ezyang/3193/orig 2025-12-04T09:17:11.1443689Z * [new branch] gh/ezyang/3194/base -> origin/gh/ezyang/3194/base 2025-12-04T09:17:11.1445455Z * [new branch] gh/ezyang/3194/head -> origin/gh/ezyang/3194/head 2025-12-04T09:17:11.1447237Z * [new branch] gh/ezyang/3194/orig -> origin/gh/ezyang/3194/orig 2025-12-04T09:17:11.1449898Z * [new branch] gh/ezyang/3195/base -> origin/gh/ezyang/3195/base 2025-12-04T09:17:11.1451732Z * [new branch] gh/ezyang/3195/head -> origin/gh/ezyang/3195/head 2025-12-04T09:17:11.1453605Z * [new branch] gh/ezyang/3195/orig -> origin/gh/ezyang/3195/orig 2025-12-04T09:17:11.1456330Z * [new branch] gh/ezyang/3196/base -> origin/gh/ezyang/3196/base 2025-12-04T09:17:11.1458205Z * [new branch] gh/ezyang/3196/head -> origin/gh/ezyang/3196/head 2025-12-04T09:17:11.1460010Z * [new branch] gh/ezyang/3196/orig -> origin/gh/ezyang/3196/orig 2025-12-04T09:17:11.1462651Z * [new branch] gh/ezyang/3197/base -> origin/gh/ezyang/3197/base 2025-12-04T09:17:11.1464477Z * [new branch] gh/ezyang/3197/head -> origin/gh/ezyang/3197/head 2025-12-04T09:17:11.1466303Z * [new branch] gh/ezyang/3197/orig -> origin/gh/ezyang/3197/orig 2025-12-04T09:17:11.1468936Z * [new branch] gh/ezyang/3198/base -> origin/gh/ezyang/3198/base 2025-12-04T09:17:11.1470732Z * [new branch] gh/ezyang/3198/head -> origin/gh/ezyang/3198/head 2025-12-04T09:17:11.1472562Z * [new branch] gh/ezyang/3198/orig -> origin/gh/ezyang/3198/orig 2025-12-04T09:17:11.1475122Z * [new branch] gh/ezyang/3199/base -> origin/gh/ezyang/3199/base 2025-12-04T09:17:11.1476984Z * [new branch] gh/ezyang/3199/head -> origin/gh/ezyang/3199/head 2025-12-04T09:17:11.1478885Z * [new branch] gh/ezyang/3199/orig -> origin/gh/ezyang/3199/orig 2025-12-04T09:17:11.1481748Z * [new branch] gh/ezyang/3200/base -> origin/gh/ezyang/3200/base 2025-12-04T09:17:11.1483553Z * [new branch] gh/ezyang/3200/head -> origin/gh/ezyang/3200/head 2025-12-04T09:17:11.1485411Z * [new branch] gh/ezyang/3200/orig -> origin/gh/ezyang/3200/orig 2025-12-04T09:17:11.1487912Z * [new branch] gh/ezyang/3201/base -> origin/gh/ezyang/3201/base 2025-12-04T09:17:11.1490229Z * [new branch] gh/ezyang/3201/head -> origin/gh/ezyang/3201/head 2025-12-04T09:17:11.1491731Z * [new branch] gh/ezyang/3201/orig -> origin/gh/ezyang/3201/orig 2025-12-04T09:17:11.1494239Z * [new branch] gh/ezyang/3202/base -> origin/gh/ezyang/3202/base 2025-12-04T09:17:11.1496156Z * [new branch] gh/ezyang/3202/head -> origin/gh/ezyang/3202/head 2025-12-04T09:17:11.1497947Z * [new branch] gh/ezyang/3202/orig -> origin/gh/ezyang/3202/orig 2025-12-04T09:17:11.1500980Z * [new branch] gh/ezyang/3203/base -> origin/gh/ezyang/3203/base 2025-12-04T09:17:11.1502803Z * [new branch] gh/ezyang/3203/head -> origin/gh/ezyang/3203/head 2025-12-04T09:17:11.1504768Z * [new branch] gh/ezyang/3203/orig -> origin/gh/ezyang/3203/orig 2025-12-04T09:17:11.1507460Z * [new branch] gh/ezyang/3204/base -> origin/gh/ezyang/3204/base 2025-12-04T09:17:11.1509302Z * [new branch] gh/ezyang/3204/head -> origin/gh/ezyang/3204/head 2025-12-04T09:17:11.1511201Z * [new branch] gh/ezyang/3204/orig -> origin/gh/ezyang/3204/orig 2025-12-04T09:17:11.1513796Z * [new branch] gh/ezyang/3205/base -> origin/gh/ezyang/3205/base 2025-12-04T09:17:11.1515588Z * [new branch] gh/ezyang/3205/head -> origin/gh/ezyang/3205/head 2025-12-04T09:17:11.1517398Z * [new branch] gh/ezyang/3205/orig -> origin/gh/ezyang/3205/orig 2025-12-04T09:17:11.1520085Z * [new branch] gh/ezyang/3206/base -> origin/gh/ezyang/3206/base 2025-12-04T09:17:11.1521979Z * [new branch] gh/ezyang/3206/head -> origin/gh/ezyang/3206/head 2025-12-04T09:17:11.1523757Z * [new branch] gh/ezyang/3206/orig -> origin/gh/ezyang/3206/orig 2025-12-04T09:17:11.1526373Z * [new branch] gh/ezyang/3207/base -> origin/gh/ezyang/3207/base 2025-12-04T09:17:11.1528108Z * [new branch] gh/ezyang/3207/head -> origin/gh/ezyang/3207/head 2025-12-04T09:17:11.1530098Z * [new branch] gh/ezyang/3207/orig -> origin/gh/ezyang/3207/orig 2025-12-04T09:17:11.1532737Z * [new branch] gh/ezyang/3208/base -> origin/gh/ezyang/3208/base 2025-12-04T09:17:11.1534648Z * [new branch] gh/ezyang/3208/head -> origin/gh/ezyang/3208/head 2025-12-04T09:17:11.1536463Z * [new branch] gh/ezyang/3208/orig -> origin/gh/ezyang/3208/orig 2025-12-04T09:17:11.1539073Z * [new branch] gh/ezyang/3209/base -> origin/gh/ezyang/3209/base 2025-12-04T09:17:11.1541030Z * [new branch] gh/ezyang/3209/head -> origin/gh/ezyang/3209/head 2025-12-04T09:17:11.1542789Z * [new branch] gh/ezyang/3209/orig -> origin/gh/ezyang/3209/orig 2025-12-04T09:17:11.1546295Z * [new branch] gh/fadara01/3/base -> origin/gh/fadara01/3/base 2025-12-04T09:17:11.1548090Z * [new branch] gh/fadara01/3/head -> origin/gh/fadara01/3/head 2025-12-04T09:17:11.1550037Z * [new branch] gh/fadara01/3/orig -> origin/gh/fadara01/3/orig 2025-12-04T09:17:11.1552656Z * [new branch] gh/fadara01/5/base -> origin/gh/fadara01/5/base 2025-12-04T09:17:11.1554425Z * [new branch] gh/fadara01/5/head -> origin/gh/fadara01/5/head 2025-12-04T09:17:11.1556251Z * [new branch] gh/fadara01/5/orig -> origin/gh/fadara01/5/orig 2025-12-04T09:17:11.1559030Z * [new branch] gh/fadara01/6/base -> origin/gh/fadara01/6/base 2025-12-04T09:17:11.1561007Z * [new branch] gh/fadara01/6/head -> origin/gh/fadara01/6/head 2025-12-04T09:17:11.1562814Z * [new branch] gh/fadara01/6/orig -> origin/gh/fadara01/6/orig 2025-12-04T09:17:11.1565599Z * [new branch] gh/fadara01/7/base -> origin/gh/fadara01/7/base 2025-12-04T09:17:11.1567225Z * [new branch] gh/fadara01/7/head -> origin/gh/fadara01/7/head 2025-12-04T09:17:11.1569055Z * [new branch] gh/fadara01/7/orig -> origin/gh/fadara01/7/orig 2025-12-04T09:17:11.1571623Z * [new branch] gh/fadara01/8/base -> origin/gh/fadara01/8/base 2025-12-04T09:17:11.1573345Z * [new branch] gh/fadara01/8/head -> origin/gh/fadara01/8/head 2025-12-04T09:17:11.1575224Z * [new branch] gh/fadara01/8/orig -> origin/gh/fadara01/8/orig 2025-12-04T09:17:11.1577760Z * [new branch] gh/fadara01/9/base -> origin/gh/fadara01/9/base 2025-12-04T09:17:11.1579659Z * [new branch] gh/fadara01/9/head -> origin/gh/fadara01/9/head 2025-12-04T09:17:11.1581575Z * [new branch] gh/fadara01/9/orig -> origin/gh/fadara01/9/orig 2025-12-04T09:17:11.1584729Z * [new branch] gh/fduwjj/182/base -> origin/gh/fduwjj/182/base 2025-12-04T09:17:11.1586564Z * [new branch] gh/fduwjj/182/head -> origin/gh/fduwjj/182/head 2025-12-04T09:17:11.1588465Z * [new branch] gh/fduwjj/182/orig -> origin/gh/fduwjj/182/orig 2025-12-04T09:17:11.1591005Z * [new branch] gh/fduwjj/211/base -> origin/gh/fduwjj/211/base 2025-12-04T09:17:11.1592887Z * [new branch] gh/fduwjj/211/head -> origin/gh/fduwjj/211/head 2025-12-04T09:17:11.1594648Z * [new branch] gh/fduwjj/211/orig -> origin/gh/fduwjj/211/orig 2025-12-04T09:17:11.1597170Z * [new branch] gh/fduwjj/212/base -> origin/gh/fduwjj/212/base 2025-12-04T09:17:11.1598995Z * [new branch] gh/fduwjj/212/head -> origin/gh/fduwjj/212/head 2025-12-04T09:17:11.1601013Z * [new branch] gh/fduwjj/212/orig -> origin/gh/fduwjj/212/orig 2025-12-04T09:17:11.1603772Z * [new branch] gh/fduwjj/213/base -> origin/gh/fduwjj/213/base 2025-12-04T09:17:11.1605722Z * [new branch] gh/fduwjj/213/head -> origin/gh/fduwjj/213/head 2025-12-04T09:17:11.1607673Z * [new branch] gh/fduwjj/213/orig -> origin/gh/fduwjj/213/orig 2025-12-04T09:17:11.1610333Z * [new branch] gh/fduwjj/226/base -> origin/gh/fduwjj/226/base 2025-12-04T09:17:11.1612102Z * [new branch] gh/fduwjj/226/head -> origin/gh/fduwjj/226/head 2025-12-04T09:17:11.1613877Z * [new branch] gh/fduwjj/226/orig -> origin/gh/fduwjj/226/orig 2025-12-04T09:17:11.1616535Z * [new branch] gh/fduwjj/229/base -> origin/gh/fduwjj/229/base 2025-12-04T09:17:11.1618438Z * [new branch] gh/fduwjj/229/head -> origin/gh/fduwjj/229/head 2025-12-04T09:17:11.1620154Z * [new branch] gh/fduwjj/229/orig -> origin/gh/fduwjj/229/orig 2025-12-04T09:17:11.1622758Z * [new branch] gh/fduwjj/233/base -> origin/gh/fduwjj/233/base 2025-12-04T09:17:11.1624635Z * [new branch] gh/fduwjj/233/head -> origin/gh/fduwjj/233/head 2025-12-04T09:17:11.1626446Z * [new branch] gh/fduwjj/233/orig -> origin/gh/fduwjj/233/orig 2025-12-04T09:17:11.1629063Z * [new branch] gh/fduwjj/234/base -> origin/gh/fduwjj/234/base 2025-12-04T09:17:11.1630860Z * [new branch] gh/fduwjj/234/head -> origin/gh/fduwjj/234/head 2025-12-04T09:17:11.1632770Z * [new branch] gh/fduwjj/234/orig -> origin/gh/fduwjj/234/orig 2025-12-04T09:17:11.1635268Z * [new branch] gh/fduwjj/235/base -> origin/gh/fduwjj/235/base 2025-12-04T09:17:11.1637185Z * [new branch] gh/fduwjj/235/head -> origin/gh/fduwjj/235/head 2025-12-04T09:17:11.1639029Z * [new branch] gh/fduwjj/235/orig -> origin/gh/fduwjj/235/orig 2025-12-04T09:17:11.1641724Z * [new branch] gh/fduwjj/236/base -> origin/gh/fduwjj/236/base 2025-12-04T09:17:11.1643442Z * [new branch] gh/fduwjj/236/head -> origin/gh/fduwjj/236/head 2025-12-04T09:17:11.1645288Z * [new branch] gh/fduwjj/236/orig -> origin/gh/fduwjj/236/orig 2025-12-04T09:17:11.1647821Z * [new branch] gh/fduwjj/237/base -> origin/gh/fduwjj/237/base 2025-12-04T09:17:11.1649485Z * [new branch] gh/fduwjj/237/head -> origin/gh/fduwjj/237/head 2025-12-04T09:17:11.1651361Z * [new branch] gh/fduwjj/237/orig -> origin/gh/fduwjj/237/orig 2025-12-04T09:17:11.1653803Z * [new branch] gh/fduwjj/238/base -> origin/gh/fduwjj/238/base 2025-12-04T09:17:11.1655617Z * [new branch] gh/fduwjj/238/head -> origin/gh/fduwjj/238/head 2025-12-04T09:17:11.1657527Z * [new branch] gh/fduwjj/238/orig -> origin/gh/fduwjj/238/orig 2025-12-04T09:17:11.1660122Z * [new branch] gh/fduwjj/239/base -> origin/gh/fduwjj/239/base 2025-12-04T09:17:11.1662006Z * [new branch] gh/fduwjj/239/head -> origin/gh/fduwjj/239/head 2025-12-04T09:17:11.1663812Z * [new branch] gh/fduwjj/239/orig -> origin/gh/fduwjj/239/orig 2025-12-04T09:17:11.1666905Z * [new branch] gh/fegin/332/base -> origin/gh/fegin/332/base 2025-12-04T09:17:11.1668808Z * [new branch] gh/fegin/332/head -> origin/gh/fegin/332/head 2025-12-04T09:17:11.1670649Z * [new branch] gh/fegin/332/orig -> origin/gh/fegin/332/orig 2025-12-04T09:17:11.1673238Z * [new branch] gh/fegin/333/base -> origin/gh/fegin/333/base 2025-12-04T09:17:11.1675079Z * [new branch] gh/fegin/333/head -> origin/gh/fegin/333/head 2025-12-04T09:17:11.1676927Z * [new branch] gh/fegin/333/orig -> origin/gh/fegin/333/orig 2025-12-04T09:17:11.1679520Z * [new branch] gh/fegin/334/base -> origin/gh/fegin/334/base 2025-12-04T09:17:11.1681580Z * [new branch] gh/fegin/334/head -> origin/gh/fegin/334/head 2025-12-04T09:17:11.1683561Z * [new branch] gh/fegin/334/orig -> origin/gh/fegin/334/orig 2025-12-04T09:17:11.1686033Z * [new branch] gh/fegin/335/base -> origin/gh/fegin/335/base 2025-12-04T09:17:11.1687854Z * [new branch] gh/fegin/335/head -> origin/gh/fegin/335/head 2025-12-04T09:17:11.1689771Z * [new branch] gh/fegin/335/orig -> origin/gh/fegin/335/orig 2025-12-04T09:17:11.1692787Z * [new branch] gh/fffrog/160/base -> origin/gh/fffrog/160/base 2025-12-04T09:17:11.1694616Z * [new branch] gh/fffrog/160/head -> origin/gh/fffrog/160/head 2025-12-04T09:17:11.1697164Z * [new branch] gh/fffrog/177/base -> origin/gh/fffrog/177/base 2025-12-04T09:17:11.1698983Z * [new branch] gh/fffrog/177/head -> origin/gh/fffrog/177/head 2025-12-04T09:17:11.1700925Z * [new branch] gh/fffrog/177/orig -> origin/gh/fffrog/177/orig 2025-12-04T09:17:11.1711335Z * [new branch] gh/fffrog/178/base -> origin/gh/fffrog/178/base 2025-12-04T09:17:11.1711643Z * [new branch] gh/fffrog/178/head -> origin/gh/fffrog/178/head 2025-12-04T09:17:11.1711857Z * [new branch] gh/fffrog/178/orig -> origin/gh/fffrog/178/orig 2025-12-04T09:17:11.1712063Z * [new branch] gh/fffrog/181/base -> origin/gh/fffrog/181/base 2025-12-04T09:17:11.1712260Z * [new branch] gh/fffrog/181/head -> origin/gh/fffrog/181/head 2025-12-04T09:17:11.1714199Z * [new branch] gh/fffrog/181/orig -> origin/gh/fffrog/181/orig 2025-12-04T09:17:11.1717348Z * [new branch] gh/fffrog/183/base -> origin/gh/fffrog/183/base 2025-12-04T09:17:11.1719304Z * [new branch] gh/fffrog/183/head -> origin/gh/fffrog/183/head 2025-12-04T09:17:11.1721227Z * [new branch] gh/fffrog/183/orig -> origin/gh/fffrog/183/orig 2025-12-04T09:17:11.1724297Z * [new branch] gh/fxdawnn/10/base -> origin/gh/fxdawnn/10/base 2025-12-04T09:17:11.1726800Z * [new branch] gh/fxdawnn/10/head -> origin/gh/fxdawnn/10/head 2025-12-04T09:17:11.1727934Z * [new branch] gh/fxdawnn/10/orig -> origin/gh/fxdawnn/10/orig 2025-12-04T09:17:11.1731457Z * [new branch] gh/fxdawnn/11/base -> origin/gh/fxdawnn/11/base 2025-12-04T09:17:11.1732539Z * [new branch] gh/fxdawnn/11/head -> origin/gh/fxdawnn/11/head 2025-12-04T09:17:11.1734957Z * [new branch] gh/fxdawnn/11/orig -> origin/gh/fxdawnn/11/orig 2025-12-04T09:17:11.1737345Z * [new branch] gh/fxdawnn/12/base -> origin/gh/fxdawnn/12/base 2025-12-04T09:17:11.1739239Z * [new branch] gh/fxdawnn/12/head -> origin/gh/fxdawnn/12/head 2025-12-04T09:17:11.1740951Z * [new branch] gh/fxdawnn/12/orig -> origin/gh/fxdawnn/12/orig 2025-12-04T09:17:11.1743538Z * [new branch] gh/fxdawnn/13/base -> origin/gh/fxdawnn/13/base 2025-12-04T09:17:11.1745924Z * [new branch] gh/fxdawnn/13/head -> origin/gh/fxdawnn/13/head 2025-12-04T09:17:11.1747870Z * [new branch] gh/fxdawnn/13/orig -> origin/gh/fxdawnn/13/orig 2025-12-04T09:17:11.1750410Z * [new branch] gh/fxdawnn/14/base -> origin/gh/fxdawnn/14/base 2025-12-04T09:17:11.1752200Z * [new branch] gh/fxdawnn/14/head -> origin/gh/fxdawnn/14/head 2025-12-04T09:17:11.1753931Z * [new branch] gh/fxdawnn/14/orig -> origin/gh/fxdawnn/14/orig 2025-12-04T09:17:11.1756583Z * [new branch] gh/fxdawnn/15/base -> origin/gh/fxdawnn/15/base 2025-12-04T09:17:11.1758415Z * [new branch] gh/fxdawnn/15/head -> origin/gh/fxdawnn/15/head 2025-12-04T09:17:11.1760368Z * [new branch] gh/fxdawnn/15/orig -> origin/gh/fxdawnn/15/orig 2025-12-04T09:17:11.1762970Z * [new branch] gh/fxdawnn/6/base -> origin/gh/fxdawnn/6/base 2025-12-04T09:17:11.1764688Z * [new branch] gh/fxdawnn/6/head -> origin/gh/fxdawnn/6/head 2025-12-04T09:17:11.1766546Z * [new branch] gh/fxdawnn/6/orig -> origin/gh/fxdawnn/6/orig 2025-12-04T09:17:11.1769239Z * [new branch] gh/fxdawnn/7/base -> origin/gh/fxdawnn/7/base 2025-12-04T09:17:11.1771159Z * [new branch] gh/fxdawnn/7/head -> origin/gh/fxdawnn/7/head 2025-12-04T09:17:11.1772846Z * [new branch] gh/fxdawnn/7/orig -> origin/gh/fxdawnn/7/orig 2025-12-04T09:17:11.1775587Z * [new branch] gh/fxdawnn/9/base -> origin/gh/fxdawnn/9/base 2025-12-04T09:17:11.1777331Z * [new branch] gh/fxdawnn/9/head -> origin/gh/fxdawnn/9/head 2025-12-04T09:17:11.1779061Z * [new branch] gh/fxdawnn/9/orig -> origin/gh/fxdawnn/9/orig 2025-12-04T09:17:11.1782230Z * [new branch] gh/galv/1/base -> origin/gh/galv/1/base 2025-12-04T09:17:11.1784081Z * [new branch] gh/galv/1/head -> origin/gh/galv/1/head 2025-12-04T09:17:11.1786340Z * [new branch] gh/galv/1/orig -> origin/gh/galv/1/orig 2025-12-04T09:17:11.1788267Z * [new branch] gh/galv/2/base -> origin/gh/galv/2/base 2025-12-04T09:17:11.1790088Z * [new branch] gh/galv/2/head -> origin/gh/galv/2/head 2025-12-04T09:17:11.1792039Z * [new branch] gh/galv/2/orig -> origin/gh/galv/2/orig 2025-12-04T09:17:11.1794615Z * [new branch] gh/galv/3/base -> origin/gh/galv/3/base 2025-12-04T09:17:11.1796377Z * [new branch] gh/galv/3/head -> origin/gh/galv/3/head 2025-12-04T09:17:11.1798412Z * [new branch] gh/galv/3/orig -> origin/gh/galv/3/orig 2025-12-04T09:17:11.1801888Z * [new branch] gh/guangyey/134/base -> origin/gh/guangyey/134/base 2025-12-04T09:17:11.1803812Z * [new branch] gh/guangyey/134/head -> origin/gh/guangyey/134/head 2025-12-04T09:17:11.1805713Z * [new branch] gh/guangyey/134/orig -> origin/gh/guangyey/134/orig 2025-12-04T09:17:11.1808547Z * [new branch] gh/guangyey/163/base -> origin/gh/guangyey/163/base 2025-12-04T09:17:11.1810115Z * [new branch] gh/guangyey/163/head -> origin/gh/guangyey/163/head 2025-12-04T09:17:11.1812465Z * [new branch] gh/guangyey/163/orig -> origin/gh/guangyey/163/orig 2025-12-04T09:17:11.1815234Z * [new branch] gh/guangyey/168/base -> origin/gh/guangyey/168/base 2025-12-04T09:17:11.1816817Z * [new branch] gh/guangyey/168/head -> origin/gh/guangyey/168/head 2025-12-04T09:17:11.1818717Z * [new branch] gh/guangyey/168/orig -> origin/gh/guangyey/168/orig 2025-12-04T09:17:11.1821210Z * [new branch] gh/guangyey/169/base -> origin/gh/guangyey/169/base 2025-12-04T09:17:11.1823308Z * [new branch] gh/guangyey/169/head -> origin/gh/guangyey/169/head 2025-12-04T09:17:11.1824971Z * [new branch] gh/guangyey/169/orig -> origin/gh/guangyey/169/orig 2025-12-04T09:17:11.1827449Z * [new branch] gh/guangyey/170/base -> origin/gh/guangyey/170/base 2025-12-04T09:17:11.1829382Z * [new branch] gh/guangyey/170/head -> origin/gh/guangyey/170/head 2025-12-04T09:17:11.1831269Z * [new branch] gh/guangyey/170/orig -> origin/gh/guangyey/170/orig 2025-12-04T09:17:11.1833761Z * [new branch] gh/guangyey/171/base -> origin/gh/guangyey/171/base 2025-12-04T09:17:11.1835534Z * [new branch] gh/guangyey/171/head -> origin/gh/guangyey/171/head 2025-12-04T09:17:11.1837356Z * [new branch] gh/guangyey/171/orig -> origin/gh/guangyey/171/orig 2025-12-04T09:17:11.1840101Z * [new branch] gh/guangyey/178/base -> origin/gh/guangyey/178/base 2025-12-04T09:17:11.1842010Z * [new branch] gh/guangyey/178/head -> origin/gh/guangyey/178/head 2025-12-04T09:17:11.1843799Z * [new branch] gh/guangyey/178/orig -> origin/gh/guangyey/178/orig 2025-12-04T09:17:11.1846241Z * [new branch] gh/guangyey/182/base -> origin/gh/guangyey/182/base 2025-12-04T09:17:11.1848610Z * [new branch] gh/guangyey/182/head -> origin/gh/guangyey/182/head 2025-12-04T09:17:11.1850430Z * [new branch] gh/guangyey/182/orig -> origin/gh/guangyey/182/orig 2025-12-04T09:17:11.1852857Z * [new branch] gh/guangyey/183/base -> origin/gh/guangyey/183/base 2025-12-04T09:17:11.1854668Z * [new branch] gh/guangyey/183/head -> origin/gh/guangyey/183/head 2025-12-04T09:17:11.1856635Z * [new branch] gh/guangyey/183/orig -> origin/gh/guangyey/183/orig 2025-12-04T09:17:11.1859174Z * [new branch] gh/guangyey/185/base -> origin/gh/guangyey/185/base 2025-12-04T09:17:11.1861108Z * [new branch] gh/guangyey/185/head -> origin/gh/guangyey/185/head 2025-12-04T09:17:11.1862908Z * [new branch] gh/guangyey/185/orig -> origin/gh/guangyey/185/orig 2025-12-04T09:17:11.1865522Z * [new branch] gh/guangyey/186/base -> origin/gh/guangyey/186/base 2025-12-04T09:17:11.1867318Z * [new branch] gh/guangyey/186/head -> origin/gh/guangyey/186/head 2025-12-04T09:17:11.1869260Z * [new branch] gh/guangyey/186/orig -> origin/gh/guangyey/186/orig 2025-12-04T09:17:11.1871645Z * [new branch] gh/guangyey/187/base -> origin/gh/guangyey/187/base 2025-12-04T09:17:11.1873456Z * [new branch] gh/guangyey/187/head -> origin/gh/guangyey/187/head 2025-12-04T09:17:11.1875310Z * [new branch] gh/guangyey/187/orig -> origin/gh/guangyey/187/orig 2025-12-04T09:17:11.1877856Z * [new branch] gh/guangyey/188/base -> origin/gh/guangyey/188/base 2025-12-04T09:17:11.1879753Z * [new branch] gh/guangyey/188/head -> origin/gh/guangyey/188/head 2025-12-04T09:17:11.1881795Z * [new branch] gh/guangyey/188/orig -> origin/gh/guangyey/188/orig 2025-12-04T09:17:11.1884264Z * [new branch] gh/guangyey/190/base -> origin/gh/guangyey/190/base 2025-12-04T09:17:11.1886089Z * [new branch] gh/guangyey/190/head -> origin/gh/guangyey/190/head 2025-12-04T09:17:11.1887921Z * [new branch] gh/guangyey/190/orig -> origin/gh/guangyey/190/orig 2025-12-04T09:17:11.1890616Z * [new branch] gh/guangyey/208/base -> origin/gh/guangyey/208/base 2025-12-04T09:17:11.1892476Z * [new branch] gh/guangyey/208/head -> origin/gh/guangyey/208/head 2025-12-04T09:17:11.1894376Z * [new branch] gh/guangyey/208/orig -> origin/gh/guangyey/208/orig 2025-12-04T09:17:11.1896851Z * [new branch] gh/guangyey/228/base -> origin/gh/guangyey/228/base 2025-12-04T09:17:11.1898632Z * [new branch] gh/guangyey/228/head -> origin/gh/guangyey/228/head 2025-12-04T09:17:11.1900848Z * [new branch] gh/guangyey/228/orig -> origin/gh/guangyey/228/orig 2025-12-04T09:17:11.1905415Z * [new branch] gh/guangyey/230/base -> origin/gh/guangyey/230/base 2025-12-04T09:17:11.1907251Z * [new branch] gh/guangyey/230/head -> origin/gh/guangyey/230/head 2025-12-04T09:17:11.1909789Z * [new branch] gh/guangyey/230/orig -> origin/gh/guangyey/230/orig 2025-12-04T09:17:11.1912382Z * [new branch] gh/guangyey/231/base -> origin/gh/guangyey/231/base 2025-12-04T09:17:11.1914147Z * [new branch] gh/guangyey/231/head -> origin/gh/guangyey/231/head 2025-12-04T09:17:11.1915971Z * [new branch] gh/guangyey/231/orig -> origin/gh/guangyey/231/orig 2025-12-04T09:17:11.1918623Z * [new branch] gh/guangyey/232/base -> origin/gh/guangyey/232/base 2025-12-04T09:17:11.1920613Z * [new branch] gh/guangyey/232/head -> origin/gh/guangyey/232/head 2025-12-04T09:17:11.1922424Z * [new branch] gh/guangyey/232/orig -> origin/gh/guangyey/232/orig 2025-12-04T09:17:11.1925028Z * [new branch] gh/guangyey/233/base -> origin/gh/guangyey/233/base 2025-12-04T09:17:11.1926866Z * [new branch] gh/guangyey/233/head -> origin/gh/guangyey/233/head 2025-12-04T09:17:11.1928689Z * [new branch] gh/guangyey/233/orig -> origin/gh/guangyey/233/orig 2025-12-04T09:17:11.1931251Z * [new branch] gh/guangyey/234/base -> origin/gh/guangyey/234/base 2025-12-04T09:17:11.1933052Z * [new branch] gh/guangyey/234/head -> origin/gh/guangyey/234/head 2025-12-04T09:17:11.1935001Z * [new branch] gh/guangyey/234/orig -> origin/gh/guangyey/234/orig 2025-12-04T09:17:11.1937549Z * [new branch] gh/guangyey/235/base -> origin/gh/guangyey/235/base 2025-12-04T09:17:11.1939484Z * [new branch] gh/guangyey/235/head -> origin/gh/guangyey/235/head 2025-12-04T09:17:11.1941258Z * [new branch] gh/guangyey/235/orig -> origin/gh/guangyey/235/orig 2025-12-04T09:17:11.1943858Z * [new branch] gh/guangyey/236/base -> origin/gh/guangyey/236/base 2025-12-04T09:17:11.1945889Z * [new branch] gh/guangyey/236/head -> origin/gh/guangyey/236/head 2025-12-04T09:17:11.1947586Z * [new branch] gh/guangyey/236/orig -> origin/gh/guangyey/236/orig 2025-12-04T09:17:11.1950216Z * [new branch] gh/guangyey/237/base -> origin/gh/guangyey/237/base 2025-12-04T09:17:11.1952062Z * [new branch] gh/guangyey/237/head -> origin/gh/guangyey/237/head 2025-12-04T09:17:11.1953910Z * [new branch] gh/guangyey/237/orig -> origin/gh/guangyey/237/orig 2025-12-04T09:17:11.1956488Z * [new branch] gh/guangyey/238/base -> origin/gh/guangyey/238/base 2025-12-04T09:17:11.1958378Z * [new branch] gh/guangyey/238/head -> origin/gh/guangyey/238/head 2025-12-04T09:17:11.1961244Z * [new branch] gh/guangyey/239/base -> origin/gh/guangyey/239/base 2025-12-04T09:17:11.1963082Z * [new branch] gh/guangyey/239/head -> origin/gh/guangyey/239/head 2025-12-04T09:17:11.1964868Z * [new branch] gh/guangyey/239/orig -> origin/gh/guangyey/239/orig 2025-12-04T09:17:11.1967445Z * [new branch] gh/guangyey/240/base -> origin/gh/guangyey/240/base 2025-12-04T09:17:11.1969327Z * [new branch] gh/guangyey/240/head -> origin/gh/guangyey/240/head 2025-12-04T09:17:11.1971217Z * [new branch] gh/guangyey/240/orig -> origin/gh/guangyey/240/orig 2025-12-04T09:17:11.1973748Z * [new branch] gh/guangyey/241/base -> origin/gh/guangyey/241/base 2025-12-04T09:17:11.1975537Z * [new branch] gh/guangyey/241/head -> origin/gh/guangyey/241/head 2025-12-04T09:17:11.1977441Z * [new branch] gh/guangyey/241/orig -> origin/gh/guangyey/241/orig 2025-12-04T09:17:11.1980049Z * [new branch] gh/guangyey/242/base -> origin/gh/guangyey/242/base 2025-12-04T09:17:11.1981867Z * [new branch] gh/guangyey/242/head -> origin/gh/guangyey/242/head 2025-12-04T09:17:11.1983675Z * [new branch] gh/guangyey/242/orig -> origin/gh/guangyey/242/orig 2025-12-04T09:17:11.1986350Z * [new branch] gh/guangyey/243/base -> origin/gh/guangyey/243/base 2025-12-04T09:17:11.1988209Z * [new branch] gh/guangyey/243/head -> origin/gh/guangyey/243/head 2025-12-04T09:17:11.1990048Z * [new branch] gh/guangyey/243/orig -> origin/gh/guangyey/243/orig 2025-12-04T09:17:11.1992778Z * [new branch] gh/guangyey/244/base -> origin/gh/guangyey/244/base 2025-12-04T09:17:11.1994975Z * [new branch] gh/guangyey/244/head -> origin/gh/guangyey/244/head 2025-12-04T09:17:11.1996409Z * [new branch] gh/guangyey/244/orig -> origin/gh/guangyey/244/orig 2025-12-04T09:17:11.1999023Z * [new branch] gh/guangyey/245/base -> origin/gh/guangyey/245/base 2025-12-04T09:17:11.2001319Z * [new branch] gh/guangyey/245/head -> origin/gh/guangyey/245/head 2025-12-04T09:17:11.2003242Z * [new branch] gh/guangyey/245/orig -> origin/gh/guangyey/245/orig 2025-12-04T09:17:11.2005834Z * [new branch] gh/guangyey/246/base -> origin/gh/guangyey/246/base 2025-12-04T09:17:11.2007697Z * [new branch] gh/guangyey/246/head -> origin/gh/guangyey/246/head 2025-12-04T09:17:11.2009539Z * [new branch] gh/guangyey/246/orig -> origin/gh/guangyey/246/orig 2025-12-04T09:17:11.2012254Z * [new branch] gh/guangyey/247/base -> origin/gh/guangyey/247/base 2025-12-04T09:17:11.2014150Z * [new branch] gh/guangyey/247/head -> origin/gh/guangyey/247/head 2025-12-04T09:17:11.2015979Z * [new branch] gh/guangyey/247/orig -> origin/gh/guangyey/247/orig 2025-12-04T09:17:11.2018591Z * [new branch] gh/guangyey/248/base -> origin/gh/guangyey/248/base 2025-12-04T09:17:11.2020612Z * [new branch] gh/guangyey/248/head -> origin/gh/guangyey/248/head 2025-12-04T09:17:11.2022354Z * [new branch] gh/guangyey/248/orig -> origin/gh/guangyey/248/orig 2025-12-04T09:17:11.2024884Z * [new branch] gh/guangyey/249/base -> origin/gh/guangyey/249/base 2025-12-04T09:17:11.2026858Z * [new branch] gh/guangyey/249/head -> origin/gh/guangyey/249/head 2025-12-04T09:17:11.2028789Z * [new branch] gh/guangyey/249/orig -> origin/gh/guangyey/249/orig 2025-12-04T09:17:11.2031351Z * [new branch] gh/guangyey/250/base -> origin/gh/guangyey/250/base 2025-12-04T09:17:11.2033178Z * [new branch] gh/guangyey/250/head -> origin/gh/guangyey/250/head 2025-12-04T09:17:11.2035015Z * [new branch] gh/guangyey/250/orig -> origin/gh/guangyey/250/orig 2025-12-04T09:17:11.2037598Z * [new branch] gh/guangyey/251/base -> origin/gh/guangyey/251/base 2025-12-04T09:17:11.2039550Z * [new branch] gh/guangyey/251/head -> origin/gh/guangyey/251/head 2025-12-04T09:17:11.2041756Z * [new branch] gh/guangyey/251/orig -> origin/gh/guangyey/251/orig 2025-12-04T09:17:11.2044156Z * [new branch] gh/guangyey/252/base -> origin/gh/guangyey/252/base 2025-12-04T09:17:11.2045889Z * [new branch] gh/guangyey/252/head -> origin/gh/guangyey/252/head 2025-12-04T09:17:11.2048007Z * [new branch] gh/guangyey/252/orig -> origin/gh/guangyey/252/orig 2025-12-04T09:17:11.2050593Z * [new branch] gh/guangyey/253/base -> origin/gh/guangyey/253/base 2025-12-04T09:17:11.2052400Z * [new branch] gh/guangyey/253/head -> origin/gh/guangyey/253/head 2025-12-04T09:17:11.2054201Z * [new branch] gh/guangyey/253/orig -> origin/gh/guangyey/253/orig 2025-12-04T09:17:11.2056758Z * [new branch] gh/guangyey/254/base -> origin/gh/guangyey/254/base 2025-12-04T09:17:11.2058721Z * [new branch] gh/guangyey/254/head -> origin/gh/guangyey/254/head 2025-12-04T09:17:11.2060444Z * [new branch] gh/guangyey/254/orig -> origin/gh/guangyey/254/orig 2025-12-04T09:17:11.2063115Z * [new branch] gh/guangyey/255/base -> origin/gh/guangyey/255/base 2025-12-04T09:17:11.2064954Z * [new branch] gh/guangyey/255/head -> origin/gh/guangyey/255/head 2025-12-04T09:17:11.2067270Z * [new branch] gh/guangyey/255/orig -> origin/gh/guangyey/255/orig 2025-12-04T09:17:11.2070121Z * [new branch] gh/guilhermeleobas/107/base -> origin/gh/guilhermeleobas/107/base 2025-12-04T09:17:11.2072227Z * [new branch] gh/guilhermeleobas/107/head -> origin/gh/guilhermeleobas/107/head 2025-12-04T09:17:11.2073853Z * [new branch] gh/guilhermeleobas/107/orig -> origin/gh/guilhermeleobas/107/orig 2025-12-04T09:17:11.2076410Z * [new branch] gh/guilhermeleobas/108/base -> origin/gh/guilhermeleobas/108/base 2025-12-04T09:17:11.2078548Z * [new branch] gh/guilhermeleobas/108/head -> origin/gh/guilhermeleobas/108/head 2025-12-04T09:17:11.2080538Z * [new branch] gh/guilhermeleobas/108/orig -> origin/gh/guilhermeleobas/108/orig 2025-12-04T09:17:11.2083080Z * [new branch] gh/guilhermeleobas/150/base -> origin/gh/guilhermeleobas/150/base 2025-12-04T09:17:11.2085924Z * [new branch] gh/guilhermeleobas/150/head -> origin/gh/guilhermeleobas/150/head 2025-12-04T09:17:11.2088358Z * [new branch] gh/guilhermeleobas/150/orig -> origin/gh/guilhermeleobas/150/orig 2025-12-04T09:17:11.2091860Z * [new branch] gh/guilhermeleobas/168/base -> origin/gh/guilhermeleobas/168/base 2025-12-04T09:17:11.2092653Z * [new branch] gh/guilhermeleobas/168/head -> origin/gh/guilhermeleobas/168/head 2025-12-04T09:17:11.2095128Z * [new branch] gh/guilhermeleobas/168/orig -> origin/gh/guilhermeleobas/168/orig 2025-12-04T09:17:11.2097781Z * [new branch] gh/guilhermeleobas/169/base -> origin/gh/guilhermeleobas/169/base 2025-12-04T09:17:11.2099713Z * [new branch] gh/guilhermeleobas/169/head -> origin/gh/guilhermeleobas/169/head 2025-12-04T09:17:11.2101636Z * [new branch] gh/guilhermeleobas/169/orig -> origin/gh/guilhermeleobas/169/orig 2025-12-04T09:17:11.2104360Z * [new branch] gh/guilhermeleobas/170/base -> origin/gh/guilhermeleobas/170/base 2025-12-04T09:17:11.2106322Z * [new branch] gh/guilhermeleobas/170/head -> origin/gh/guilhermeleobas/170/head 2025-12-04T09:17:11.2108044Z * [new branch] gh/guilhermeleobas/170/orig -> origin/gh/guilhermeleobas/170/orig 2025-12-04T09:17:11.2111041Z * [new branch] gh/guilhermeleobas/171/base -> origin/gh/guilhermeleobas/171/base 2025-12-04T09:17:11.2112870Z * [new branch] gh/guilhermeleobas/171/head -> origin/gh/guilhermeleobas/171/head 2025-12-04T09:17:11.2115274Z * [new branch] gh/guilhermeleobas/171/orig -> origin/gh/guilhermeleobas/171/orig 2025-12-04T09:17:11.2117902Z * [new branch] gh/guilhermeleobas/173/base -> origin/gh/guilhermeleobas/173/base 2025-12-04T09:17:11.2119733Z * [new branch] gh/guilhermeleobas/173/head -> origin/gh/guilhermeleobas/173/head 2025-12-04T09:17:11.2121778Z * [new branch] gh/guilhermeleobas/173/orig -> origin/gh/guilhermeleobas/173/orig 2025-12-04T09:17:11.2124194Z * [new branch] gh/guilhermeleobas/193/base -> origin/gh/guilhermeleobas/193/base 2025-12-04T09:17:11.2126124Z * [new branch] gh/guilhermeleobas/193/head -> origin/gh/guilhermeleobas/193/head 2025-12-04T09:17:11.2128115Z * [new branch] gh/guilhermeleobas/193/orig -> origin/gh/guilhermeleobas/193/orig 2025-12-04T09:17:11.2130482Z * [new branch] gh/guilhermeleobas/204/base -> origin/gh/guilhermeleobas/204/base 2025-12-04T09:17:11.2132436Z * [new branch] gh/guilhermeleobas/204/head -> origin/gh/guilhermeleobas/204/head 2025-12-04T09:17:11.2134231Z * [new branch] gh/guilhermeleobas/204/orig -> origin/gh/guilhermeleobas/204/orig 2025-12-04T09:17:11.2136750Z * [new branch] gh/guilhermeleobas/211/base -> origin/gh/guilhermeleobas/211/base 2025-12-04T09:17:11.2138508Z * [new branch] gh/guilhermeleobas/211/head -> origin/gh/guilhermeleobas/211/head 2025-12-04T09:17:11.2140480Z * [new branch] gh/guilhermeleobas/211/orig -> origin/gh/guilhermeleobas/211/orig 2025-12-04T09:17:11.2143050Z * [new branch] gh/guilhermeleobas/226/base -> origin/gh/guilhermeleobas/226/base 2025-12-04T09:17:11.2145078Z * [new branch] gh/guilhermeleobas/226/head -> origin/gh/guilhermeleobas/226/head 2025-12-04T09:17:11.2146863Z * [new branch] gh/guilhermeleobas/226/orig -> origin/gh/guilhermeleobas/226/orig 2025-12-04T09:17:11.2149305Z * [new branch] gh/guilhermeleobas/236/base -> origin/gh/guilhermeleobas/236/base 2025-12-04T09:17:11.2151270Z * [new branch] gh/guilhermeleobas/236/head -> origin/gh/guilhermeleobas/236/head 2025-12-04T09:17:11.2152974Z * [new branch] gh/guilhermeleobas/236/orig -> origin/gh/guilhermeleobas/236/orig 2025-12-04T09:17:11.2155422Z * [new branch] gh/guilhermeleobas/247/base -> origin/gh/guilhermeleobas/247/base 2025-12-04T09:17:11.2157265Z * [new branch] gh/guilhermeleobas/247/head -> origin/gh/guilhermeleobas/247/head 2025-12-04T09:17:11.2159157Z * [new branch] gh/guilhermeleobas/247/orig -> origin/gh/guilhermeleobas/247/orig 2025-12-04T09:17:11.2161972Z * [new branch] gh/guilhermeleobas/248/base -> origin/gh/guilhermeleobas/248/base 2025-12-04T09:17:11.2163794Z * [new branch] gh/guilhermeleobas/248/head -> origin/gh/guilhermeleobas/248/head 2025-12-04T09:17:11.2165774Z * [new branch] gh/guilhermeleobas/248/orig -> origin/gh/guilhermeleobas/248/orig 2025-12-04T09:17:11.2168425Z * [new branch] gh/guilhermeleobas/250/base -> origin/gh/guilhermeleobas/250/base 2025-12-04T09:17:11.2170275Z * [new branch] gh/guilhermeleobas/250/head -> origin/gh/guilhermeleobas/250/head 2025-12-04T09:17:11.2172193Z * [new branch] gh/guilhermeleobas/250/orig -> origin/gh/guilhermeleobas/250/orig 2025-12-04T09:17:11.2175010Z * [new branch] gh/guilhermeleobas/253/base -> origin/gh/guilhermeleobas/253/base 2025-12-04T09:17:11.2177006Z * [new branch] gh/guilhermeleobas/253/head -> origin/gh/guilhermeleobas/253/head 2025-12-04T09:17:11.2179084Z * [new branch] gh/guilhermeleobas/253/orig -> origin/gh/guilhermeleobas/253/orig 2025-12-04T09:17:11.2181437Z * [new branch] gh/guilhermeleobas/254/base -> origin/gh/guilhermeleobas/254/base 2025-12-04T09:17:11.2183406Z * [new branch] gh/guilhermeleobas/254/head -> origin/gh/guilhermeleobas/254/head 2025-12-04T09:17:11.2185326Z * [new branch] gh/guilhermeleobas/254/orig -> origin/gh/guilhermeleobas/254/orig 2025-12-04T09:17:11.2187687Z * [new branch] gh/guilhermeleobas/255/base -> origin/gh/guilhermeleobas/255/base 2025-12-04T09:17:11.2189666Z * [new branch] gh/guilhermeleobas/255/head -> origin/gh/guilhermeleobas/255/head 2025-12-04T09:17:11.2191865Z * [new branch] gh/guilhermeleobas/255/orig -> origin/gh/guilhermeleobas/255/orig 2025-12-04T09:17:11.2194115Z * [new branch] gh/guilhermeleobas/256/base -> origin/gh/guilhermeleobas/256/base 2025-12-04T09:17:11.2196389Z * [new branch] gh/guilhermeleobas/256/head -> origin/gh/guilhermeleobas/256/head 2025-12-04T09:17:11.2197838Z * [new branch] gh/guilhermeleobas/256/orig -> origin/gh/guilhermeleobas/256/orig 2025-12-04T09:17:11.2200806Z * [new branch] gh/guilhermeleobas/257/base -> origin/gh/guilhermeleobas/257/base 2025-12-04T09:17:11.2202795Z * [new branch] gh/guilhermeleobas/257/head -> origin/gh/guilhermeleobas/257/head 2025-12-04T09:17:11.2204798Z * [new branch] gh/guilhermeleobas/257/orig -> origin/gh/guilhermeleobas/257/orig 2025-12-04T09:17:11.2207401Z * [new branch] gh/guilhermeleobas/258/base -> origin/gh/guilhermeleobas/258/base 2025-12-04T09:17:11.2209176Z * [new branch] gh/guilhermeleobas/258/head -> origin/gh/guilhermeleobas/258/head 2025-12-04T09:17:11.2211001Z * [new branch] gh/guilhermeleobas/258/orig -> origin/gh/guilhermeleobas/258/orig 2025-12-04T09:17:11.2213559Z * [new branch] gh/guilhermeleobas/259/base -> origin/gh/guilhermeleobas/259/base 2025-12-04T09:17:11.2215434Z * [new branch] gh/guilhermeleobas/259/head -> origin/gh/guilhermeleobas/259/head 2025-12-04T09:17:11.2217355Z * [new branch] gh/guilhermeleobas/259/orig -> origin/gh/guilhermeleobas/259/orig 2025-12-04T09:17:11.2219952Z * [new branch] gh/guilhermeleobas/260/base -> origin/gh/guilhermeleobas/260/base 2025-12-04T09:17:11.2224036Z * [new branch] gh/guilhermeleobas/260/head -> origin/gh/guilhermeleobas/260/head 2025-12-04T09:17:11.2226049Z * [new branch] gh/guilhermeleobas/260/orig -> origin/gh/guilhermeleobas/260/orig 2025-12-04T09:17:11.2226561Z * [new branch] gh/guilhermeleobas/261/base -> origin/gh/guilhermeleobas/261/base 2025-12-04T09:17:11.2228037Z * [new branch] gh/guilhermeleobas/261/head -> origin/gh/guilhermeleobas/261/head 2025-12-04T09:17:11.2229719Z * [new branch] gh/guilhermeleobas/261/orig -> origin/gh/guilhermeleobas/261/orig 2025-12-04T09:17:11.2232274Z * [new branch] gh/guilhermeleobas/262/base -> origin/gh/guilhermeleobas/262/base 2025-12-04T09:17:11.2234420Z * [new branch] gh/guilhermeleobas/262/head -> origin/gh/guilhermeleobas/262/head 2025-12-04T09:17:11.2235935Z * [new branch] gh/guilhermeleobas/262/orig -> origin/gh/guilhermeleobas/262/orig 2025-12-04T09:17:11.2238792Z * [new branch] gh/guilhermeleobas/263/base -> origin/gh/guilhermeleobas/263/base 2025-12-04T09:17:11.2241029Z * [new branch] gh/guilhermeleobas/263/head -> origin/gh/guilhermeleobas/263/head 2025-12-04T09:17:11.2242562Z * [new branch] gh/guilhermeleobas/263/orig -> origin/gh/guilhermeleobas/263/orig 2025-12-04T09:17:11.2245219Z * [new branch] gh/guilhermeleobas/264/base -> origin/gh/guilhermeleobas/264/base 2025-12-04T09:17:11.2247043Z * [new branch] gh/guilhermeleobas/264/head -> origin/gh/guilhermeleobas/264/head 2025-12-04T09:17:11.2248923Z * [new branch] gh/guilhermeleobas/264/orig -> origin/gh/guilhermeleobas/264/orig 2025-12-04T09:17:11.2251545Z * [new branch] gh/guilhermeleobas/265/base -> origin/gh/guilhermeleobas/265/base 2025-12-04T09:17:11.2253386Z * [new branch] gh/guilhermeleobas/265/head -> origin/gh/guilhermeleobas/265/head 2025-12-04T09:17:11.2255280Z * [new branch] gh/guilhermeleobas/265/orig -> origin/gh/guilhermeleobas/265/orig 2025-12-04T09:17:11.2257948Z * [new branch] gh/guilhermeleobas/266/base -> origin/gh/guilhermeleobas/266/base 2025-12-04T09:17:11.2259833Z * [new branch] gh/guilhermeleobas/266/head -> origin/gh/guilhermeleobas/266/head 2025-12-04T09:17:11.2261858Z * [new branch] gh/guilhermeleobas/266/orig -> origin/gh/guilhermeleobas/266/orig 2025-12-04T09:17:11.2264307Z * [new branch] gh/guilhermeleobas/267/base -> origin/gh/guilhermeleobas/267/base 2025-12-04T09:17:11.2266132Z * [new branch] gh/guilhermeleobas/267/head -> origin/gh/guilhermeleobas/267/head 2025-12-04T09:17:11.2268101Z * [new branch] gh/guilhermeleobas/267/orig -> origin/gh/guilhermeleobas/267/orig 2025-12-04T09:17:11.2271222Z * [new branch] gh/hameerabbasi/1/base -> origin/gh/hameerabbasi/1/base 2025-12-04T09:17:11.2273082Z * [new branch] gh/hameerabbasi/1/head -> origin/gh/hameerabbasi/1/head 2025-12-04T09:17:11.2276032Z * [new branch] gh/hameerabbasi/2/base -> origin/gh/hameerabbasi/2/base 2025-12-04T09:17:11.2277916Z * [new branch] gh/hameerabbasi/2/head -> origin/gh/hameerabbasi/2/head 2025-12-04T09:17:11.2279860Z * [new branch] gh/hameerabbasi/2/orig -> origin/gh/hameerabbasi/2/orig 2025-12-04T09:17:11.2282370Z * [new branch] gh/hameerabbasi/3/base -> origin/gh/hameerabbasi/3/base 2025-12-04T09:17:11.2284198Z * [new branch] gh/hameerabbasi/3/head -> origin/gh/hameerabbasi/3/head 2025-12-04T09:17:11.2286157Z * [new branch] gh/hameerabbasi/3/orig -> origin/gh/hameerabbasi/3/orig 2025-12-04T09:17:11.2288653Z * [new branch] gh/hameerabbasi/4/base -> origin/gh/hameerabbasi/4/base 2025-12-04T09:17:11.2290542Z * [new branch] gh/hameerabbasi/4/head -> origin/gh/hameerabbasi/4/head 2025-12-04T09:17:11.2292430Z * [new branch] gh/hameerabbasi/4/orig -> origin/gh/hameerabbasi/4/orig 2025-12-04T09:17:11.2295609Z * [new branch] gh/huydhn/1/next -> origin/gh/huydhn/1/next 2025-12-04T09:17:11.2298024Z * [new branch] gh/huydhn/2/next -> origin/gh/huydhn/2/next 2025-12-04T09:17:11.2300637Z * [new branch] gh/huydhn/3/next -> origin/gh/huydhn/3/next 2025-12-04T09:17:11.2303383Z * [new branch] gh/huydhn/4/next -> origin/gh/huydhn/4/next 2025-12-04T09:17:11.2305891Z * [new branch] gh/huydhn/5/next -> origin/gh/huydhn/5/next 2025-12-04T09:17:11.2308344Z * [new branch] gh/huydhn/6/next -> origin/gh/huydhn/6/next 2025-12-04T09:17:11.2311545Z * [new branch] gh/int3/97/base -> origin/gh/int3/97/base 2025-12-04T09:17:11.2313339Z * [new branch] gh/int3/97/head -> origin/gh/int3/97/head 2025-12-04T09:17:11.2316626Z * [new branch] gh/isuruf/101/base -> origin/gh/isuruf/101/base 2025-12-04T09:17:11.2318399Z * [new branch] gh/isuruf/101/head -> origin/gh/isuruf/101/head 2025-12-04T09:17:11.2321082Z * [new branch] gh/isuruf/146/base -> origin/gh/isuruf/146/base 2025-12-04T09:17:11.2323028Z * [new branch] gh/isuruf/146/head -> origin/gh/isuruf/146/head 2025-12-04T09:17:11.2324854Z * [new branch] gh/isuruf/146/orig -> origin/gh/isuruf/146/orig 2025-12-04T09:17:11.2327368Z * [new branch] gh/isuruf/158/base -> origin/gh/isuruf/158/base 2025-12-04T09:17:11.2329218Z * [new branch] gh/isuruf/158/head -> origin/gh/isuruf/158/head 2025-12-04T09:17:11.2331610Z * [new branch] gh/isuruf/159/base -> origin/gh/isuruf/159/base 2025-12-04T09:17:11.2333436Z * [new branch] gh/isuruf/159/head -> origin/gh/isuruf/159/head 2025-12-04T09:17:11.2335998Z * [new branch] gh/isuruf/160/base -> origin/gh/isuruf/160/base 2025-12-04T09:17:11.2337889Z * [new branch] gh/isuruf/160/head -> origin/gh/isuruf/160/head 2025-12-04T09:17:11.2339744Z * [new branch] gh/isuruf/160/orig -> origin/gh/isuruf/160/orig 2025-12-04T09:17:11.2342306Z * [new branch] gh/isuruf/81/base -> origin/gh/isuruf/81/base 2025-12-04T09:17:11.2344145Z * [new branch] gh/isuruf/81/head -> origin/gh/isuruf/81/head 2025-12-04T09:17:11.2346506Z * [new branch] gh/isuruf/81/orig -> origin/gh/isuruf/81/orig 2025-12-04T09:17:11.2349552Z * [new branch] gh/jamesjwu/176/base -> origin/gh/jamesjwu/176/base 2025-12-04T09:17:11.2351392Z * [new branch] gh/jamesjwu/176/head -> origin/gh/jamesjwu/176/head 2025-12-04T09:17:11.2353237Z * [new branch] gh/jamesjwu/176/orig -> origin/gh/jamesjwu/176/orig 2025-12-04T09:17:11.2355716Z * [new branch] gh/jamesjwu/187/base -> origin/gh/jamesjwu/187/base 2025-12-04T09:17:11.2357769Z * [new branch] gh/jamesjwu/187/head -> origin/gh/jamesjwu/187/head 2025-12-04T09:17:11.2359570Z * [new branch] gh/jamesjwu/187/orig -> origin/gh/jamesjwu/187/orig 2025-12-04T09:17:11.2362198Z * [new branch] gh/jamesjwu/196/base -> origin/gh/jamesjwu/196/base 2025-12-04T09:17:11.2363983Z * [new branch] gh/jamesjwu/196/head -> origin/gh/jamesjwu/196/head 2025-12-04T09:17:11.2365888Z * [new branch] gh/jamesjwu/196/orig -> origin/gh/jamesjwu/196/orig 2025-12-04T09:17:11.2368442Z * [new branch] gh/jamesjwu/198/base -> origin/gh/jamesjwu/198/base 2025-12-04T09:17:11.2370221Z * [new branch] gh/jamesjwu/198/head -> origin/gh/jamesjwu/198/head 2025-12-04T09:17:11.2372564Z * [new branch] gh/jamesjwu/198/orig -> origin/gh/jamesjwu/198/orig 2025-12-04T09:17:11.2375109Z * [new branch] gh/jamesjwu/207/base -> origin/gh/jamesjwu/207/base 2025-12-04T09:17:11.2377085Z * [new branch] gh/jamesjwu/207/head -> origin/gh/jamesjwu/207/head 2025-12-04T09:17:11.2378909Z * [new branch] gh/jamesjwu/207/orig -> origin/gh/jamesjwu/207/orig 2025-12-04T09:17:11.2381600Z * [new branch] gh/jamesjwu/208/base -> origin/gh/jamesjwu/208/base 2025-12-04T09:17:11.2383775Z * [new branch] gh/jamesjwu/208/head -> origin/gh/jamesjwu/208/head 2025-12-04T09:17:11.2385723Z * [new branch] gh/jamesjwu/208/orig -> origin/gh/jamesjwu/208/orig 2025-12-04T09:17:11.2388352Z * [new branch] gh/jamesjwu/52/base -> origin/gh/jamesjwu/52/base 2025-12-04T09:17:11.2390156Z * [new branch] gh/jamesjwu/52/head -> origin/gh/jamesjwu/52/head 2025-12-04T09:17:11.2392691Z * [new branch] gh/jamesjwu/53/base -> origin/gh/jamesjwu/53/base 2025-12-04T09:17:11.2394359Z * [new branch] gh/jamesjwu/53/head -> origin/gh/jamesjwu/53/head 2025-12-04T09:17:11.2396704Z * [new branch] gh/jamesjwu/54/base -> origin/gh/jamesjwu/54/base 2025-12-04T09:17:11.2398464Z * [new branch] gh/jamesjwu/54/head -> origin/gh/jamesjwu/54/head 2025-12-04T09:17:11.2401273Z * [new branch] gh/jamesjwu/55/base -> origin/gh/jamesjwu/55/base 2025-12-04T09:17:11.2406017Z * [new branch] gh/jamesjwu/55/head -> origin/gh/jamesjwu/55/head 2025-12-04T09:17:11.2408371Z * [new branch] gh/jamesjwu/56/base -> origin/gh/jamesjwu/56/base 2025-12-04T09:17:11.2410135Z * [new branch] gh/jamesjwu/56/head -> origin/gh/jamesjwu/56/head 2025-12-04T09:17:11.2412596Z * [new branch] gh/jamesjwu/57/base -> origin/gh/jamesjwu/57/base 2025-12-04T09:17:11.2414391Z * [new branch] gh/jamesjwu/57/head -> origin/gh/jamesjwu/57/head 2025-12-04T09:17:11.2416744Z * [new branch] gh/jamesjwu/58/base -> origin/gh/jamesjwu/58/base 2025-12-04T09:17:11.2418707Z * [new branch] gh/jamesjwu/58/head -> origin/gh/jamesjwu/58/head 2025-12-04T09:17:11.2421048Z * [new branch] gh/jamesjwu/59/base -> origin/gh/jamesjwu/59/base 2025-12-04T09:17:11.2422844Z * [new branch] gh/jamesjwu/59/head -> origin/gh/jamesjwu/59/head 2025-12-04T09:17:11.2425245Z * [new branch] gh/jamesjwu/60/base -> origin/gh/jamesjwu/60/base 2025-12-04T09:17:11.2427036Z * [new branch] gh/jamesjwu/60/head -> origin/gh/jamesjwu/60/head 2025-12-04T09:17:11.2429457Z * [new branch] gh/jamesjwu/61/base -> origin/gh/jamesjwu/61/base 2025-12-04T09:17:11.2431227Z * [new branch] gh/jamesjwu/61/head -> origin/gh/jamesjwu/61/head 2025-12-04T09:17:11.2433716Z * [new branch] gh/jamesjwu/62/base -> origin/gh/jamesjwu/62/base 2025-12-04T09:17:11.2435529Z * [new branch] gh/jamesjwu/62/head -> origin/gh/jamesjwu/62/head 2025-12-04T09:17:11.2437903Z * [new branch] gh/jamesjwu/63/base -> origin/gh/jamesjwu/63/base 2025-12-04T09:17:11.2439780Z * [new branch] gh/jamesjwu/63/head -> origin/gh/jamesjwu/63/head 2025-12-04T09:17:11.2443184Z * [new branch] gh/jamesjwu/64/base -> origin/gh/jamesjwu/64/base 2025-12-04T09:17:11.2444877Z * [new branch] gh/jamesjwu/64/head -> origin/gh/jamesjwu/64/head 2025-12-04T09:17:11.2447302Z * [new branch] gh/jamesjwu/65/base -> origin/gh/jamesjwu/65/base 2025-12-04T09:17:11.2449095Z * [new branch] gh/jamesjwu/65/head -> origin/gh/jamesjwu/65/head 2025-12-04T09:17:11.2452255Z * [new branch] gh/janeyx99/165/base -> origin/gh/janeyx99/165/base 2025-12-04T09:17:11.2454190Z * [new branch] gh/janeyx99/165/head -> origin/gh/janeyx99/165/head 2025-12-04T09:17:11.2455997Z * [new branch] gh/janeyx99/165/orig -> origin/gh/janeyx99/165/orig 2025-12-04T09:17:11.2458676Z * [new branch] gh/janeyx99/201/base -> origin/gh/janeyx99/201/base 2025-12-04T09:17:11.2460351Z * [new branch] gh/janeyx99/201/head -> origin/gh/janeyx99/201/head 2025-12-04T09:17:11.2462182Z * [new branch] gh/janeyx99/201/orig -> origin/gh/janeyx99/201/orig 2025-12-04T09:17:11.2464937Z * [new branch] gh/janeyx99/225/base -> origin/gh/janeyx99/225/base 2025-12-04T09:17:11.2466799Z * [new branch] gh/janeyx99/225/head -> origin/gh/janeyx99/225/head 2025-12-04T09:17:11.2468658Z * [new branch] gh/janeyx99/225/orig -> origin/gh/janeyx99/225/orig 2025-12-04T09:17:11.2471181Z * [new branch] gh/janeyx99/299/base -> origin/gh/janeyx99/299/base 2025-12-04T09:17:11.2473165Z * [new branch] gh/janeyx99/299/head -> origin/gh/janeyx99/299/head 2025-12-04T09:17:11.2474863Z * [new branch] gh/janeyx99/299/orig -> origin/gh/janeyx99/299/orig 2025-12-04T09:17:11.2477725Z * [new branch] gh/janeyx99/302/base -> origin/gh/janeyx99/302/base 2025-12-04T09:17:11.2479656Z * [new branch] gh/janeyx99/302/head -> origin/gh/janeyx99/302/head 2025-12-04T09:17:11.2482207Z * [new branch] gh/janeyx99/303/base -> origin/gh/janeyx99/303/base 2025-12-04T09:17:11.2483970Z * [new branch] gh/janeyx99/303/head -> origin/gh/janeyx99/303/head 2025-12-04T09:17:11.2486535Z * [new branch] gh/janeyx99/305/base -> origin/gh/janeyx99/305/base 2025-12-04T09:17:11.2488472Z * [new branch] gh/janeyx99/305/head -> origin/gh/janeyx99/305/head 2025-12-04T09:17:11.2490797Z * [new branch] gh/janeyx99/306/base -> origin/gh/janeyx99/306/base 2025-12-04T09:17:11.2492525Z * [new branch] gh/janeyx99/306/head -> origin/gh/janeyx99/306/head 2025-12-04T09:17:11.2495044Z * [new branch] gh/janeyx99/314/base -> origin/gh/janeyx99/314/base 2025-12-04T09:17:11.2496889Z * [new branch] gh/janeyx99/314/head -> origin/gh/janeyx99/314/head 2025-12-04T09:17:11.2498751Z * [new branch] gh/janeyx99/314/orig -> origin/gh/janeyx99/314/orig 2025-12-04T09:17:11.2501291Z * [new branch] gh/janeyx99/315/base -> origin/gh/janeyx99/315/base 2025-12-04T09:17:11.2503550Z * [new branch] gh/janeyx99/315/head -> origin/gh/janeyx99/315/head 2025-12-04T09:17:11.2505450Z * [new branch] gh/janeyx99/315/orig -> origin/gh/janeyx99/315/orig 2025-12-04T09:17:11.2508071Z * [new branch] gh/janeyx99/316/base -> origin/gh/janeyx99/316/base 2025-12-04T09:17:11.2510011Z * [new branch] gh/janeyx99/316/head -> origin/gh/janeyx99/316/head 2025-12-04T09:17:11.2511803Z * [new branch] gh/janeyx99/316/orig -> origin/gh/janeyx99/316/orig 2025-12-04T09:17:11.2514459Z * [new branch] gh/janeyx99/317/base -> origin/gh/janeyx99/317/base 2025-12-04T09:17:11.2516257Z * [new branch] gh/janeyx99/317/head -> origin/gh/janeyx99/317/head 2025-12-04T09:17:11.2518233Z * [new branch] gh/janeyx99/317/orig -> origin/gh/janeyx99/317/orig 2025-12-04T09:17:11.2521038Z * [new branch] gh/janeyx99/325/base -> origin/gh/janeyx99/325/base 2025-12-04T09:17:11.2522907Z * [new branch] gh/janeyx99/325/head -> origin/gh/janeyx99/325/head 2025-12-04T09:17:11.2524663Z * [new branch] gh/janeyx99/325/orig -> origin/gh/janeyx99/325/orig 2025-12-04T09:17:11.2527161Z * [new branch] gh/janeyx99/327/base -> origin/gh/janeyx99/327/base 2025-12-04T09:17:11.2529133Z * [new branch] gh/janeyx99/327/head -> origin/gh/janeyx99/327/head 2025-12-04T09:17:11.2530873Z * [new branch] gh/janeyx99/327/orig -> origin/gh/janeyx99/327/orig 2025-12-04T09:17:11.2534036Z * [new branch] gh/janeyx99/328/base -> origin/gh/janeyx99/328/base 2025-12-04T09:17:11.2536008Z * [new branch] gh/janeyx99/328/head -> origin/gh/janeyx99/328/head 2025-12-04T09:17:11.2537927Z * [new branch] gh/janeyx99/328/orig -> origin/gh/janeyx99/328/orig 2025-12-04T09:17:11.2540304Z * [new branch] gh/janeyx99/329/base -> origin/gh/janeyx99/329/base 2025-12-04T09:17:11.2542328Z * [new branch] gh/janeyx99/329/head -> origin/gh/janeyx99/329/head 2025-12-04T09:17:11.2544155Z * [new branch] gh/janeyx99/329/orig -> origin/gh/janeyx99/329/orig 2025-12-04T09:17:11.2547051Z * [new branch] gh/janeyx99/330/base -> origin/gh/janeyx99/330/base 2025-12-04T09:17:11.2549433Z * [new branch] gh/janeyx99/330/head -> origin/gh/janeyx99/330/head 2025-12-04T09:17:11.2550992Z * [new branch] gh/janeyx99/330/orig -> origin/gh/janeyx99/330/orig 2025-12-04T09:17:11.2553403Z * [new branch] gh/janeyx99/331/base -> origin/gh/janeyx99/331/base 2025-12-04T09:17:11.2555333Z * [new branch] gh/janeyx99/331/head -> origin/gh/janeyx99/331/head 2025-12-04T09:17:11.2557037Z * [new branch] gh/janeyx99/331/orig -> origin/gh/janeyx99/331/orig 2025-12-04T09:17:11.2559788Z * [new branch] gh/janeyx99/332/base -> origin/gh/janeyx99/332/base 2025-12-04T09:17:11.2561989Z * [new branch] gh/janeyx99/332/head -> origin/gh/janeyx99/332/head 2025-12-04T09:17:11.2563578Z * [new branch] gh/janeyx99/332/orig -> origin/gh/janeyx99/332/orig 2025-12-04T09:17:11.2565986Z * [new branch] gh/janeyx99/333/base -> origin/gh/janeyx99/333/base 2025-12-04T09:17:11.2567805Z * [new branch] gh/janeyx99/333/head -> origin/gh/janeyx99/333/head 2025-12-04T09:17:11.2569772Z * [new branch] gh/janeyx99/333/orig -> origin/gh/janeyx99/333/orig 2025-12-04T09:17:11.2572484Z * [new branch] gh/janeyx99/88/base -> origin/gh/janeyx99/88/base 2025-12-04T09:17:11.2574364Z * [new branch] gh/janeyx99/88/head -> origin/gh/janeyx99/88/head 2025-12-04T09:17:11.2576197Z * [new branch] gh/janeyx99/88/orig -> origin/gh/janeyx99/88/orig 2025-12-04T09:17:11.2579302Z * [new branch] gh/jansel/360/base -> origin/gh/jansel/360/base 2025-12-04T09:17:11.2581167Z * [new branch] gh/jansel/360/head -> origin/gh/jansel/360/head 2025-12-04T09:17:11.2583738Z * [new branch] gh/jansel/451/base -> origin/gh/jansel/451/base 2025-12-04T09:17:11.2585628Z * [new branch] gh/jansel/451/head -> origin/gh/jansel/451/head 2025-12-04T09:17:11.2587458Z * [new branch] gh/jansel/451/orig -> origin/gh/jansel/451/orig 2025-12-04T09:17:11.2590000Z * [new branch] gh/jansel/462/base -> origin/gh/jansel/462/base 2025-12-04T09:17:11.2591813Z * [new branch] gh/jansel/462/head -> origin/gh/jansel/462/head 2025-12-04T09:17:11.2593608Z * [new branch] gh/jansel/462/orig -> origin/gh/jansel/462/orig 2025-12-04T09:17:11.2596070Z * [new branch] gh/jansel/533/base -> origin/gh/jansel/533/base 2025-12-04T09:17:11.2597893Z * [new branch] gh/jansel/533/head -> origin/gh/jansel/533/head 2025-12-04T09:17:11.2599788Z * [new branch] gh/jansel/533/orig -> origin/gh/jansel/533/orig 2025-12-04T09:17:11.2602754Z * [new branch] gh/jansel/552/base -> origin/gh/jansel/552/base 2025-12-04T09:17:11.2604558Z * [new branch] gh/jansel/552/head -> origin/gh/jansel/552/head 2025-12-04T09:17:11.2606373Z * [new branch] gh/jansel/552/orig -> origin/gh/jansel/552/orig 2025-12-04T09:17:11.2609028Z * [new branch] gh/jansel/553/base -> origin/gh/jansel/553/base 2025-12-04T09:17:11.2610824Z * [new branch] gh/jansel/553/head -> origin/gh/jansel/553/head 2025-12-04T09:17:11.2612641Z * [new branch] gh/jansel/553/orig -> origin/gh/jansel/553/orig 2025-12-04T09:17:11.2615181Z * [new branch] gh/jansel/554/base -> origin/gh/jansel/554/base 2025-12-04T09:17:11.2617540Z * [new branch] gh/jansel/554/head -> origin/gh/jansel/554/head 2025-12-04T09:17:11.2619395Z * [new branch] gh/jansel/554/orig -> origin/gh/jansel/554/orig 2025-12-04T09:17:11.2621881Z * [new branch] gh/jansel/555/base -> origin/gh/jansel/555/base 2025-12-04T09:17:11.2623961Z * [new branch] gh/jansel/555/head -> origin/gh/jansel/555/head 2025-12-04T09:17:11.2625754Z * [new branch] gh/jansel/555/orig -> origin/gh/jansel/555/orig 2025-12-04T09:17:11.2628199Z * [new branch] gh/jansel/556/base -> origin/gh/jansel/556/base 2025-12-04T09:17:11.2630543Z * [new branch] gh/jansel/556/head -> origin/gh/jansel/556/head 2025-12-04T09:17:11.2631935Z * [new branch] gh/jansel/556/orig -> origin/gh/jansel/556/orig 2025-12-04T09:17:11.2634736Z * [new branch] gh/jansel/557/base -> origin/gh/jansel/557/base 2025-12-04T09:17:11.2636402Z * [new branch] gh/jansel/557/head -> origin/gh/jansel/557/head 2025-12-04T09:17:11.2638285Z * [new branch] gh/jansel/557/orig -> origin/gh/jansel/557/orig 2025-12-04T09:17:11.2641094Z * [new branch] gh/jansel/558/base -> origin/gh/jansel/558/base 2025-12-04T09:17:11.2642804Z * [new branch] gh/jansel/558/head -> origin/gh/jansel/558/head 2025-12-04T09:17:11.2644511Z * [new branch] gh/jansel/558/orig -> origin/gh/jansel/558/orig 2025-12-04T09:17:11.2647023Z * [new branch] gh/jansel/559/base -> origin/gh/jansel/559/base 2025-12-04T09:17:11.2649066Z * [new branch] gh/jansel/559/head -> origin/gh/jansel/559/head 2025-12-04T09:17:11.2650760Z * [new branch] gh/jansel/559/orig -> origin/gh/jansel/559/orig 2025-12-04T09:17:11.2653552Z * [new branch] gh/jansel/560/base -> origin/gh/jansel/560/base 2025-12-04T09:17:11.2655065Z * [new branch] gh/jansel/560/head -> origin/gh/jansel/560/head 2025-12-04T09:17:11.2656860Z * [new branch] gh/jansel/560/orig -> origin/gh/jansel/560/orig 2025-12-04T09:17:11.2659510Z * [new branch] gh/jansel/561/base -> origin/gh/jansel/561/base 2025-12-04T09:17:11.2661310Z * [new branch] gh/jansel/561/head -> origin/gh/jansel/561/head 2025-12-04T09:17:11.2663164Z * [new branch] gh/jansel/561/orig -> origin/gh/jansel/561/orig 2025-12-04T09:17:11.2665701Z * [new branch] gh/jansel/562/base -> origin/gh/jansel/562/base 2025-12-04T09:17:11.2667556Z * [new branch] gh/jansel/562/head -> origin/gh/jansel/562/head 2025-12-04T09:17:11.2669354Z * [new branch] gh/jansel/562/orig -> origin/gh/jansel/562/orig 2025-12-04T09:17:11.2671843Z * [new branch] gh/jansel/563/base -> origin/gh/jansel/563/base 2025-12-04T09:17:11.2674201Z * [new branch] gh/jansel/563/head -> origin/gh/jansel/563/head 2025-12-04T09:17:11.2676065Z * [new branch] gh/jansel/563/orig -> origin/gh/jansel/563/orig 2025-12-04T09:17:11.2679174Z * [new branch] gh/jansel/564/base -> origin/gh/jansel/564/base 2025-12-04T09:17:11.2681193Z * [new branch] gh/jansel/564/head -> origin/gh/jansel/564/head 2025-12-04T09:17:11.2683157Z * [new branch] gh/jansel/564/orig -> origin/gh/jansel/564/orig 2025-12-04T09:17:11.2685732Z * [new branch] gh/jansel/565/base -> origin/gh/jansel/565/base 2025-12-04T09:17:11.2687546Z * [new branch] gh/jansel/565/head -> origin/gh/jansel/565/head 2025-12-04T09:17:11.2689378Z * [new branch] gh/jansel/565/orig -> origin/gh/jansel/565/orig 2025-12-04T09:17:11.2691954Z * [new branch] gh/jansel/566/base -> origin/gh/jansel/566/base 2025-12-04T09:17:11.2693771Z * [new branch] gh/jansel/566/head -> origin/gh/jansel/566/head 2025-12-04T09:17:11.2695607Z * [new branch] gh/jansel/566/orig -> origin/gh/jansel/566/orig 2025-12-04T09:17:11.2698752Z * [new branch] gh/jansel/567/base -> origin/gh/jansel/567/base 2025-12-04T09:17:11.2700966Z * [new branch] gh/jansel/567/head -> origin/gh/jansel/567/head 2025-12-04T09:17:11.2702777Z * [new branch] gh/jansel/567/orig -> origin/gh/jansel/567/orig 2025-12-04T09:17:11.2705390Z * [new branch] gh/jansel/568/base -> origin/gh/jansel/568/base 2025-12-04T09:17:11.2707470Z * [new branch] gh/jansel/568/head -> origin/gh/jansel/568/head 2025-12-04T09:17:11.2709283Z * [new branch] gh/jansel/568/orig -> origin/gh/jansel/568/orig 2025-12-04T09:17:11.2711926Z * [new branch] gh/jansel/569/base -> origin/gh/jansel/569/base 2025-12-04T09:17:11.2713783Z * [new branch] gh/jansel/569/head -> origin/gh/jansel/569/head 2025-12-04T09:17:11.2715881Z * [new branch] gh/jansel/569/orig -> origin/gh/jansel/569/orig 2025-12-04T09:17:11.2718520Z * [new branch] gh/jansel/570/base -> origin/gh/jansel/570/base 2025-12-04T09:17:11.2720430Z * [new branch] gh/jansel/570/head -> origin/gh/jansel/570/head 2025-12-04T09:17:11.2722192Z * [new branch] gh/jansel/570/orig -> origin/gh/jansel/570/orig 2025-12-04T09:17:11.2724763Z * [new branch] gh/jansel/571/base -> origin/gh/jansel/571/base 2025-12-04T09:17:11.2726984Z * [new branch] gh/jansel/571/head -> origin/gh/jansel/571/head 2025-12-04T09:17:11.2728462Z * [new branch] gh/jansel/571/orig -> origin/gh/jansel/571/orig 2025-12-04T09:17:11.2731000Z * [new branch] gh/jansel/572/base -> origin/gh/jansel/572/base 2025-12-04T09:17:11.2732817Z * [new branch] gh/jansel/572/head -> origin/gh/jansel/572/head 2025-12-04T09:17:11.2734620Z * [new branch] gh/jansel/572/orig -> origin/gh/jansel/572/orig 2025-12-04T09:17:11.2737332Z * [new branch] gh/jansel/573/base -> origin/gh/jansel/573/base 2025-12-04T09:17:11.2739178Z * [new branch] gh/jansel/573/head -> origin/gh/jansel/573/head 2025-12-04T09:17:11.2741019Z * [new branch] gh/jansel/573/orig -> origin/gh/jansel/573/orig 2025-12-04T09:17:11.2743646Z * [new branch] gh/jansel/574/base -> origin/gh/jansel/574/base 2025-12-04T09:17:11.2745477Z * [new branch] gh/jansel/574/head -> origin/gh/jansel/574/head 2025-12-04T09:17:11.2747428Z * [new branch] gh/jansel/574/orig -> origin/gh/jansel/574/orig 2025-12-04T09:17:11.2750076Z * [new branch] gh/jansel/575/base -> origin/gh/jansel/575/base 2025-12-04T09:17:11.2751884Z * [new branch] gh/jansel/575/head -> origin/gh/jansel/575/head 2025-12-04T09:17:11.2753741Z * [new branch] gh/jansel/575/orig -> origin/gh/jansel/575/orig 2025-12-04T09:17:11.2756393Z * [new branch] gh/jansel/576/base -> origin/gh/jansel/576/base 2025-12-04T09:17:11.2758195Z * [new branch] gh/jansel/576/head -> origin/gh/jansel/576/head 2025-12-04T09:17:11.2760109Z * [new branch] gh/jansel/576/orig -> origin/gh/jansel/576/orig 2025-12-04T09:17:11.2763448Z * [new branch] gh/jbschlosser/247/base -> origin/gh/jbschlosser/247/base 2025-12-04T09:17:11.2765308Z * [new branch] gh/jbschlosser/247/head -> origin/gh/jbschlosser/247/head 2025-12-04T09:17:11.2767083Z * [new branch] gh/jbschlosser/247/orig -> origin/gh/jbschlosser/247/orig 2025-12-04T09:17:11.2769709Z * [new branch] gh/jbschlosser/250/base -> origin/gh/jbschlosser/250/base 2025-12-04T09:17:11.2771495Z * [new branch] gh/jbschlosser/250/head -> origin/gh/jbschlosser/250/head 2025-12-04T09:17:11.2773363Z * [new branch] gh/jbschlosser/250/orig -> origin/gh/jbschlosser/250/orig 2025-12-04T09:17:11.2776649Z * [new branch] gh/jerryzh168/1/base -> origin/gh/jerryzh168/1/base 2025-12-04T09:17:11.2778484Z * [new branch] gh/jerryzh168/1/head -> origin/gh/jerryzh168/1/head 2025-12-04T09:17:11.2780192Z * [new branch] gh/jerryzh168/1/orig -> origin/gh/jerryzh168/1/orig 2025-12-04T09:17:11.2783436Z * [new branch] gh/jiayisunx/59/base -> origin/gh/jiayisunx/59/base 2025-12-04T09:17:11.2785240Z * [new branch] gh/jiayisunx/59/head -> origin/gh/jiayisunx/59/head 2025-12-04T09:17:11.2787256Z * [new branch] gh/jiayisunx/59/orig -> origin/gh/jiayisunx/59/orig 2025-12-04T09:17:11.2789557Z * [new branch] gh/jiayisunx/61/base -> origin/gh/jiayisunx/61/base 2025-12-04T09:17:11.2791396Z * [new branch] gh/jiayisunx/61/head -> origin/gh/jiayisunx/61/head 2025-12-04T09:17:11.2793223Z * [new branch] gh/jiayisunx/61/orig -> origin/gh/jiayisunx/61/orig 2025-12-04T09:17:11.2795714Z * [new branch] gh/jiayisunx/68/base -> origin/gh/jiayisunx/68/base 2025-12-04T09:17:11.2798057Z * [new branch] gh/jiayisunx/68/head -> origin/gh/jiayisunx/68/head 2025-12-04T09:17:11.2800050Z * [new branch] gh/jiayisunx/68/orig -> origin/gh/jiayisunx/68/orig 2025-12-04T09:17:11.2803041Z * [new branch] gh/jiayisunx/77/base -> origin/gh/jiayisunx/77/base 2025-12-04T09:17:11.2804766Z * [new branch] gh/jiayisunx/77/head -> origin/gh/jiayisunx/77/head 2025-12-04T09:17:11.2806596Z * [new branch] gh/jiayisunx/77/orig -> origin/gh/jiayisunx/77/orig 2025-12-04T09:17:11.2809254Z * [new branch] gh/jiayisunx/78/base -> origin/gh/jiayisunx/78/base 2025-12-04T09:17:11.2811089Z * [new branch] gh/jiayisunx/78/head -> origin/gh/jiayisunx/78/head 2025-12-04T09:17:11.2812898Z * [new branch] gh/jiayisunx/78/orig -> origin/gh/jiayisunx/78/orig 2025-12-04T09:17:11.2815369Z * [new branch] gh/jiayisunx/79/base -> origin/gh/jiayisunx/79/base 2025-12-04T09:17:11.2817196Z * [new branch] gh/jiayisunx/79/head -> origin/gh/jiayisunx/79/head 2025-12-04T09:17:11.2819001Z * [new branch] gh/jiayisunx/79/orig -> origin/gh/jiayisunx/79/orig 2025-12-04T09:17:11.2821562Z * [new branch] gh/jiayisunx/82/base -> origin/gh/jiayisunx/82/base 2025-12-04T09:17:11.2823378Z * [new branch] gh/jiayisunx/82/head -> origin/gh/jiayisunx/82/head 2025-12-04T09:17:11.2825745Z * [new branch] gh/jiayisunx/82/orig -> origin/gh/jiayisunx/82/orig 2025-12-04T09:17:11.2828322Z * [new branch] gh/jiayisunx/83/base -> origin/gh/jiayisunx/83/base 2025-12-04T09:17:11.2830168Z * [new branch] gh/jiayisunx/83/head -> origin/gh/jiayisunx/83/head 2025-12-04T09:17:11.2832011Z * [new branch] gh/jiayisunx/83/orig -> origin/gh/jiayisunx/83/orig 2025-12-04T09:17:11.2834527Z * [new branch] gh/jiayisunx/84/base -> origin/gh/jiayisunx/84/base 2025-12-04T09:17:11.2836424Z * [new branch] gh/jiayisunx/84/head -> origin/gh/jiayisunx/84/head 2025-12-04T09:17:11.2838263Z * [new branch] gh/jiayisunx/84/orig -> origin/gh/jiayisunx/84/orig 2025-12-04T09:17:11.2840948Z * [new branch] gh/jiayisunx/85/base -> origin/gh/jiayisunx/85/base 2025-12-04T09:17:11.2842743Z * [new branch] gh/jiayisunx/85/head -> origin/gh/jiayisunx/85/head 2025-12-04T09:17:11.2844862Z * [new branch] gh/jiayisunx/85/orig -> origin/gh/jiayisunx/85/orig 2025-12-04T09:17:11.2847543Z * [new branch] gh/jiayisunx/86/base -> origin/gh/jiayisunx/86/base 2025-12-04T09:17:11.2849384Z * [new branch] gh/jiayisunx/86/head -> origin/gh/jiayisunx/86/head 2025-12-04T09:17:11.2851712Z * [new branch] gh/jiayisunx/86/orig -> origin/gh/jiayisunx/86/orig 2025-12-04T09:17:11.2853827Z * [new branch] gh/jiayisunx/87/base -> origin/gh/jiayisunx/87/base 2025-12-04T09:17:11.2855672Z * [new branch] gh/jiayisunx/87/head -> origin/gh/jiayisunx/87/head 2025-12-04T09:17:11.2857563Z * [new branch] gh/jiayisunx/87/orig -> origin/gh/jiayisunx/87/orig 2025-12-04T09:17:11.2860217Z * [new branch] gh/jiayisunx/88/base -> origin/gh/jiayisunx/88/base 2025-12-04T09:17:11.2862037Z * [new branch] gh/jiayisunx/88/head -> origin/gh/jiayisunx/88/head 2025-12-04T09:17:11.2863847Z * [new branch] gh/jiayisunx/88/orig -> origin/gh/jiayisunx/88/orig 2025-12-04T09:17:11.2866420Z * [new branch] gh/jiayisunx/89/base -> origin/gh/jiayisunx/89/base 2025-12-04T09:17:11.2868204Z * [new branch] gh/jiayisunx/89/head -> origin/gh/jiayisunx/89/head 2025-12-04T09:17:11.2870047Z * [new branch] gh/jiayisunx/89/orig -> origin/gh/jiayisunx/89/orig 2025-12-04T09:17:11.2872538Z * [new branch] gh/jiayisunx/90/base -> origin/gh/jiayisunx/90/base 2025-12-04T09:17:11.2874335Z * [new branch] gh/jiayisunx/90/head -> origin/gh/jiayisunx/90/head 2025-12-04T09:17:11.2876175Z * [new branch] gh/jiayisunx/90/orig -> origin/gh/jiayisunx/90/orig 2025-12-04T09:17:11.2879104Z * [new branch] gh/jjwu@meta.com/1/base -> origin/gh/jjwu@meta.com/1/base 2025-12-04T09:17:11.2881035Z * [new branch] gh/jjwu@meta.com/1/head -> origin/gh/jjwu@meta.com/1/head 2025-12-04T09:17:11.2884133Z * [new branch] gh/jturney/1/base -> origin/gh/jturney/1/base 2025-12-04T09:17:11.2885929Z * [new branch] gh/jturney/1/head -> origin/gh/jturney/1/head 2025-12-04T09:17:11.2887769Z * [new branch] gh/jturney/1/orig -> origin/gh/jturney/1/orig 2025-12-04T09:17:11.2890296Z * [new branch] gh/jturney/2/base -> origin/gh/jturney/2/base 2025-12-04T09:17:11.2892145Z * [new branch] gh/jturney/2/head -> origin/gh/jturney/2/head 2025-12-04T09:17:11.2893929Z * [new branch] gh/jturney/2/orig -> origin/gh/jturney/2/orig 2025-12-04T09:17:11.2897154Z * [new branch] gh/karthickai/10/base -> origin/gh/karthickai/10/base 2025-12-04T09:17:11.2899096Z * [new branch] gh/karthickai/10/head -> origin/gh/karthickai/10/head 2025-12-04T09:17:11.2901363Z * [new branch] gh/karthickai/10/orig -> origin/gh/karthickai/10/orig 2025-12-04T09:17:11.2906567Z * [new branch] gh/karthickai/11/base -> origin/gh/karthickai/11/base 2025-12-04T09:17:11.2908575Z * [new branch] gh/karthickai/11/head -> origin/gh/karthickai/11/head 2025-12-04T09:17:11.2910549Z * [new branch] gh/karthickai/11/orig -> origin/gh/karthickai/11/orig 2025-12-04T09:17:11.2913421Z * [new branch] gh/karthickai/12/base -> origin/gh/karthickai/12/base 2025-12-04T09:17:11.2915275Z * [new branch] gh/karthickai/12/head -> origin/gh/karthickai/12/head 2025-12-04T09:17:11.2917121Z * [new branch] gh/karthickai/12/orig -> origin/gh/karthickai/12/orig 2025-12-04T09:17:11.2919718Z * [new branch] gh/karthickai/13/base -> origin/gh/karthickai/13/base 2025-12-04T09:17:11.2921842Z * [new branch] gh/karthickai/13/head -> origin/gh/karthickai/13/head 2025-12-04T09:17:11.2923676Z * [new branch] gh/karthickai/13/orig -> origin/gh/karthickai/13/orig 2025-12-04T09:17:11.2926220Z * [new branch] gh/karthickai/14/base -> origin/gh/karthickai/14/base 2025-12-04T09:17:11.2928182Z * [new branch] gh/karthickai/14/head -> origin/gh/karthickai/14/head 2025-12-04T09:17:11.2930489Z * [new branch] gh/karthickai/14/orig -> origin/gh/karthickai/14/orig 2025-12-04T09:17:11.2933223Z * [new branch] gh/karthickai/15/base -> origin/gh/karthickai/15/base 2025-12-04T09:17:11.2935282Z * [new branch] gh/karthickai/15/head -> origin/gh/karthickai/15/head 2025-12-04T09:17:11.2937498Z * [new branch] gh/karthickai/15/orig -> origin/gh/karthickai/15/orig 2025-12-04T09:17:11.2940153Z * [new branch] gh/karthickai/16/base -> origin/gh/karthickai/16/base 2025-12-04T09:17:11.2942202Z * [new branch] gh/karthickai/16/head -> origin/gh/karthickai/16/head 2025-12-04T09:17:11.2944254Z * [new branch] gh/karthickai/16/orig -> origin/gh/karthickai/16/orig 2025-12-04T09:17:11.2946905Z * [new branch] gh/karthickai/17/base -> origin/gh/karthickai/17/base 2025-12-04T09:17:11.2948926Z * [new branch] gh/karthickai/17/head -> origin/gh/karthickai/17/head 2025-12-04T09:17:11.2950960Z * [new branch] gh/karthickai/17/orig -> origin/gh/karthickai/17/orig 2025-12-04T09:17:11.2953775Z * [new branch] gh/karthickai/18/base -> origin/gh/karthickai/18/base 2025-12-04T09:17:11.2956056Z * [new branch] gh/karthickai/18/head -> origin/gh/karthickai/18/head 2025-12-04T09:17:11.2958278Z * [new branch] gh/karthickai/18/orig -> origin/gh/karthickai/18/orig 2025-12-04T09:17:11.2961408Z * [new branch] gh/karthickai/19/base -> origin/gh/karthickai/19/base 2025-12-04T09:17:11.2963657Z * [new branch] gh/karthickai/19/head -> origin/gh/karthickai/19/head 2025-12-04T09:17:11.2966128Z * [new branch] gh/karthickai/19/orig -> origin/gh/karthickai/19/orig 2025-12-04T09:17:11.2969667Z * [new branch] gh/karthickai/20/base -> origin/gh/karthickai/20/base 2025-12-04T09:17:11.2972433Z * [new branch] gh/karthickai/20/head -> origin/gh/karthickai/20/head 2025-12-04T09:17:11.2974567Z * [new branch] gh/karthickai/20/orig -> origin/gh/karthickai/20/orig 2025-12-04T09:17:11.2977327Z * [new branch] gh/karthickai/21/base -> origin/gh/karthickai/21/base 2025-12-04T09:17:11.2979617Z * [new branch] gh/karthickai/21/head -> origin/gh/karthickai/21/head 2025-12-04T09:17:11.2982219Z * [new branch] gh/karthickai/21/orig -> origin/gh/karthickai/21/orig 2025-12-04T09:17:11.2985371Z * [new branch] gh/karthickai/22/base -> origin/gh/karthickai/22/base 2025-12-04T09:17:11.2987238Z * [new branch] gh/karthickai/22/head -> origin/gh/karthickai/22/head 2025-12-04T09:17:11.2989279Z * [new branch] gh/karthickai/22/orig -> origin/gh/karthickai/22/orig 2025-12-04T09:17:11.2992136Z * [new branch] gh/karthickai/23/base -> origin/gh/karthickai/23/base 2025-12-04T09:17:11.2994342Z * [new branch] gh/karthickai/23/head -> origin/gh/karthickai/23/head 2025-12-04T09:17:11.2996505Z * [new branch] gh/karthickai/23/orig -> origin/gh/karthickai/23/orig 2025-12-04T09:17:11.2999316Z * [new branch] gh/karthickai/24/base -> origin/gh/karthickai/24/base 2025-12-04T09:17:11.3001902Z * [new branch] gh/karthickai/24/head -> origin/gh/karthickai/24/head 2025-12-04T09:17:11.3003961Z * [new branch] gh/karthickai/24/orig -> origin/gh/karthickai/24/orig 2025-12-04T09:17:11.3007255Z * [new branch] gh/karthickai/25/base -> origin/gh/karthickai/25/base 2025-12-04T09:17:11.3009442Z * [new branch] gh/karthickai/25/head -> origin/gh/karthickai/25/head 2025-12-04T09:17:11.3011505Z * [new branch] gh/karthickai/25/orig -> origin/gh/karthickai/25/orig 2025-12-04T09:17:11.3014204Z * [new branch] gh/karthickai/26/base -> origin/gh/karthickai/26/base 2025-12-04T09:17:11.3016684Z * [new branch] gh/karthickai/26/head -> origin/gh/karthickai/26/head 2025-12-04T09:17:11.3018725Z * [new branch] gh/karthickai/26/orig -> origin/gh/karthickai/26/orig 2025-12-04T09:17:11.3023427Z * [new branch] gh/karthickai/6/base -> origin/gh/karthickai/6/base 2025-12-04T09:17:11.3026048Z * [new branch] gh/karthickai/6/head -> origin/gh/karthickai/6/head 2025-12-04T09:17:11.3028220Z * [new branch] gh/karthickai/6/orig -> origin/gh/karthickai/6/orig 2025-12-04T09:17:11.3031530Z * [new branch] gh/krocki/1/base -> origin/gh/krocki/1/base 2025-12-04T09:17:11.3033561Z * [new branch] gh/krocki/1/head -> origin/gh/krocki/1/head 2025-12-04T09:17:11.3035608Z * [new branch] gh/krocki/1/orig -> origin/gh/krocki/1/orig 2025-12-04T09:17:11.3038433Z * [new branch] gh/krocki/2/base -> origin/gh/krocki/2/base 2025-12-04T09:17:11.3040856Z * [new branch] gh/krocki/2/head -> origin/gh/krocki/2/head 2025-12-04T09:17:11.3042894Z * [new branch] gh/krocki/2/orig -> origin/gh/krocki/2/orig 2025-12-04T09:17:11.3046082Z * [new branch] gh/kurtamohler/60/base -> origin/gh/kurtamohler/60/base 2025-12-04T09:17:11.3048064Z * [new branch] gh/kurtamohler/60/head -> origin/gh/kurtamohler/60/head 2025-12-04T09:17:11.3050123Z * [new branch] gh/kurtamohler/60/orig -> origin/gh/kurtamohler/60/orig 2025-12-04T09:17:11.3052932Z * [new branch] gh/kurtamohler/61/base -> origin/gh/kurtamohler/61/base 2025-12-04T09:17:11.3054993Z * [new branch] gh/kurtamohler/61/head -> origin/gh/kurtamohler/61/head 2025-12-04T09:17:11.3057020Z * [new branch] gh/kurtamohler/61/orig -> origin/gh/kurtamohler/61/orig 2025-12-04T09:17:11.3059752Z * [new branch] gh/kurtamohler/62/base -> origin/gh/kurtamohler/62/base 2025-12-04T09:17:11.3061788Z * [new branch] gh/kurtamohler/62/head -> origin/gh/kurtamohler/62/head 2025-12-04T09:17:11.3063922Z * [new branch] gh/kurtamohler/62/orig -> origin/gh/kurtamohler/62/orig 2025-12-04T09:17:11.3066943Z * [new branch] gh/kurtamohler/63/base -> origin/gh/kurtamohler/63/base 2025-12-04T09:17:11.3069009Z * [new branch] gh/kurtamohler/63/head -> origin/gh/kurtamohler/63/head 2025-12-04T09:17:11.3071041Z * [new branch] gh/kurtamohler/63/orig -> origin/gh/kurtamohler/63/orig 2025-12-04T09:17:11.3073887Z * [new branch] gh/kurtamohler/64/base -> origin/gh/kurtamohler/64/base 2025-12-04T09:17:11.3075987Z * [new branch] gh/kurtamohler/64/head -> origin/gh/kurtamohler/64/head 2025-12-04T09:17:11.3078086Z * [new branch] gh/kurtamohler/64/orig -> origin/gh/kurtamohler/64/orig 2025-12-04T09:17:11.3081398Z * [new branch] gh/kurtamohler/65/base -> origin/gh/kurtamohler/65/base 2025-12-04T09:17:11.3083288Z * [new branch] gh/kurtamohler/65/head -> origin/gh/kurtamohler/65/head 2025-12-04T09:17:11.3085295Z * [new branch] gh/kurtamohler/65/orig -> origin/gh/kurtamohler/65/orig 2025-12-04T09:17:11.3088047Z * [new branch] gh/kurtamohler/66/base -> origin/gh/kurtamohler/66/base 2025-12-04T09:17:11.3090222Z * [new branch] gh/kurtamohler/66/head -> origin/gh/kurtamohler/66/head 2025-12-04T09:17:11.3092905Z * [new branch] gh/kurtamohler/66/orig -> origin/gh/kurtamohler/66/orig 2025-12-04T09:17:11.3095601Z * [new branch] gh/kurtamohler/67/base -> origin/gh/kurtamohler/67/base 2025-12-04T09:17:11.3097654Z * [new branch] gh/kurtamohler/67/head -> origin/gh/kurtamohler/67/head 2025-12-04T09:17:11.3099790Z * [new branch] gh/kurtamohler/67/orig -> origin/gh/kurtamohler/67/orig 2025-12-04T09:17:11.3103395Z * [new branch] gh/kwen2501/130/base -> origin/gh/kwen2501/130/base 2025-12-04T09:17:11.3105525Z * [new branch] gh/kwen2501/130/head -> origin/gh/kwen2501/130/head 2025-12-04T09:17:11.3107683Z * [new branch] gh/kwen2501/130/orig -> origin/gh/kwen2501/130/orig 2025-12-04T09:17:11.3110643Z * [new branch] gh/kwen2501/170/base -> origin/gh/kwen2501/170/base 2025-12-04T09:17:11.3112760Z * [new branch] gh/kwen2501/170/head -> origin/gh/kwen2501/170/head 2025-12-04T09:17:11.3115541Z * [new branch] gh/kwen2501/187/base -> origin/gh/kwen2501/187/base 2025-12-04T09:17:11.3117707Z * [new branch] gh/kwen2501/187/head -> origin/gh/kwen2501/187/head 2025-12-04T09:17:11.3119817Z * [new branch] gh/kwen2501/187/orig -> origin/gh/kwen2501/187/orig 2025-12-04T09:17:11.3122709Z * [new branch] gh/kwen2501/188/base -> origin/gh/kwen2501/188/base 2025-12-04T09:17:11.3124666Z * [new branch] gh/kwen2501/188/head -> origin/gh/kwen2501/188/head 2025-12-04T09:17:11.3126703Z * [new branch] gh/kwen2501/188/orig -> origin/gh/kwen2501/188/orig 2025-12-04T09:17:11.3129460Z * [new branch] gh/kwen2501/211/base -> origin/gh/kwen2501/211/base 2025-12-04T09:17:11.3131521Z * [new branch] gh/kwen2501/211/head -> origin/gh/kwen2501/211/head 2025-12-04T09:17:11.3134208Z * [new branch] gh/kwen2501/224/base -> origin/gh/kwen2501/224/base 2025-12-04T09:17:11.3136309Z * [new branch] gh/kwen2501/224/head -> origin/gh/kwen2501/224/head 2025-12-04T09:17:11.3138415Z * [new branch] gh/kwen2501/224/orig -> origin/gh/kwen2501/224/orig 2025-12-04T09:17:11.3141190Z * [new branch] gh/kwen2501/228/base -> origin/gh/kwen2501/228/base 2025-12-04T09:17:11.3143209Z * [new branch] gh/kwen2501/228/head -> origin/gh/kwen2501/228/head 2025-12-04T09:17:11.3145269Z * [new branch] gh/kwen2501/228/orig -> origin/gh/kwen2501/228/orig 2025-12-04T09:17:11.3148180Z * [new branch] gh/kwen2501/234/base -> origin/gh/kwen2501/234/base 2025-12-04T09:17:11.3150240Z * [new branch] gh/kwen2501/234/head -> origin/gh/kwen2501/234/head 2025-12-04T09:17:11.3152280Z * [new branch] gh/kwen2501/234/orig -> origin/gh/kwen2501/234/orig 2025-12-04T09:17:11.3155041Z * [new branch] gh/kwen2501/235/base -> origin/gh/kwen2501/235/base 2025-12-04T09:17:11.3157111Z * [new branch] gh/kwen2501/235/head -> origin/gh/kwen2501/235/head 2025-12-04T09:17:11.3159222Z * [new branch] gh/kwen2501/235/orig -> origin/gh/kwen2501/235/orig 2025-12-04T09:17:11.3162131Z * [new branch] gh/kwen2501/236/base -> origin/gh/kwen2501/236/base 2025-12-04T09:17:11.3164366Z * [new branch] gh/kwen2501/236/head -> origin/gh/kwen2501/236/head 2025-12-04T09:17:11.3166784Z * [new branch] gh/kwen2501/236/orig -> origin/gh/kwen2501/236/orig 2025-12-04T09:17:11.3169735Z * [new branch] gh/kwen2501/237/base -> origin/gh/kwen2501/237/base 2025-12-04T09:17:11.3171680Z * [new branch] gh/kwen2501/237/head -> origin/gh/kwen2501/237/head 2025-12-04T09:17:11.3173816Z * [new branch] gh/kwen2501/237/orig -> origin/gh/kwen2501/237/orig 2025-12-04T09:17:11.3176587Z * [new branch] gh/kwen2501/238/base -> origin/gh/kwen2501/238/base 2025-12-04T09:17:11.3178652Z * [new branch] gh/kwen2501/238/head -> origin/gh/kwen2501/238/head 2025-12-04T09:17:11.3180857Z * [new branch] gh/kwen2501/238/orig -> origin/gh/kwen2501/238/orig 2025-12-04T09:17:11.3184404Z * [new branch] gh/kwen2501/240/base -> origin/gh/kwen2501/240/base 2025-12-04T09:17:11.3185768Z * [new branch] gh/kwen2501/240/head -> origin/gh/kwen2501/240/head 2025-12-04T09:17:11.3187795Z * [new branch] gh/kwen2501/240/orig -> origin/gh/kwen2501/240/orig 2025-12-04T09:17:11.3190474Z * [new branch] gh/kwen2501/241/base -> origin/gh/kwen2501/241/base 2025-12-04T09:17:11.3192643Z * [new branch] gh/kwen2501/241/head -> origin/gh/kwen2501/241/head 2025-12-04T09:17:11.3194687Z * [new branch] gh/kwen2501/241/orig -> origin/gh/kwen2501/241/orig 2025-12-04T09:17:11.3197407Z * [new branch] gh/kwen2501/247/base -> origin/gh/kwen2501/247/base 2025-12-04T09:17:11.3199503Z * [new branch] gh/kwen2501/247/head -> origin/gh/kwen2501/247/head 2025-12-04T09:17:11.3201923Z * [new branch] gh/kwen2501/247/orig -> origin/gh/kwen2501/247/orig 2025-12-04T09:17:11.3204858Z * [new branch] gh/kwen2501/252/base -> origin/gh/kwen2501/252/base 2025-12-04T09:17:11.3206731Z * [new branch] gh/kwen2501/252/head -> origin/gh/kwen2501/252/head 2025-12-04T09:17:11.3208806Z * [new branch] gh/kwen2501/252/orig -> origin/gh/kwen2501/252/orig 2025-12-04T09:17:11.3212183Z * [new branch] gh/kwen2501/259/base -> origin/gh/kwen2501/259/base 2025-12-04T09:17:11.3214374Z * [new branch] gh/kwen2501/259/head -> origin/gh/kwen2501/259/head 2025-12-04T09:17:11.3216484Z * [new branch] gh/kwen2501/259/orig -> origin/gh/kwen2501/259/orig 2025-12-04T09:17:11.3219337Z * [new branch] gh/kwen2501/260/base -> origin/gh/kwen2501/260/base 2025-12-04T09:17:11.3221591Z * [new branch] gh/kwen2501/260/head -> origin/gh/kwen2501/260/head 2025-12-04T09:17:11.3223771Z * [new branch] gh/kwen2501/260/orig -> origin/gh/kwen2501/260/orig 2025-12-04T09:17:11.3226570Z * [new branch] gh/kwen2501/268/base -> origin/gh/kwen2501/268/base 2025-12-04T09:17:11.3228593Z * [new branch] gh/kwen2501/268/head -> origin/gh/kwen2501/268/head 2025-12-04T09:17:11.3230641Z * [new branch] gh/kwen2501/268/orig -> origin/gh/kwen2501/268/orig 2025-12-04T09:17:11.3233461Z * [new branch] gh/kwen2501/269/base -> origin/gh/kwen2501/269/base 2025-12-04T09:17:11.3241389Z * [new branch] gh/kwen2501/269/head -> origin/gh/kwen2501/269/head 2025-12-04T09:17:11.3241816Z * [new branch] gh/kwen2501/269/orig -> origin/gh/kwen2501/269/orig 2025-12-04T09:17:11.3242048Z * [new branch] gh/kwen2501/270/base -> origin/gh/kwen2501/270/base 2025-12-04T09:17:11.3242259Z * [new branch] gh/kwen2501/270/head -> origin/gh/kwen2501/270/head 2025-12-04T09:17:11.3244562Z * [new branch] gh/kwen2501/270/orig -> origin/gh/kwen2501/270/orig 2025-12-04T09:17:11.3247582Z * [new branch] gh/kwen2501/271/base -> origin/gh/kwen2501/271/base 2025-12-04T09:17:11.3249601Z * [new branch] gh/kwen2501/271/head -> origin/gh/kwen2501/271/head 2025-12-04T09:17:11.3251641Z * [new branch] gh/kwen2501/271/orig -> origin/gh/kwen2501/271/orig 2025-12-04T09:17:11.3254562Z * [new branch] gh/kwen2501/274/base -> origin/gh/kwen2501/274/base 2025-12-04T09:17:11.3256679Z * [new branch] gh/kwen2501/274/head -> origin/gh/kwen2501/274/head 2025-12-04T09:17:11.3258719Z * [new branch] gh/kwen2501/274/orig -> origin/gh/kwen2501/274/orig 2025-12-04T09:17:11.3261693Z * [new branch] gh/kwen2501/275/base -> origin/gh/kwen2501/275/base 2025-12-04T09:17:11.3263901Z * [new branch] gh/kwen2501/275/head -> origin/gh/kwen2501/275/head 2025-12-04T09:17:11.3266154Z * [new branch] gh/kwen2501/275/orig -> origin/gh/kwen2501/275/orig 2025-12-04T09:17:11.3268940Z * [new branch] gh/kwen2501/276/base -> origin/gh/kwen2501/276/base 2025-12-04T09:17:11.3270948Z * [new branch] gh/kwen2501/276/head -> origin/gh/kwen2501/276/head 2025-12-04T09:17:11.3272939Z * [new branch] gh/kwen2501/276/orig -> origin/gh/kwen2501/276/orig 2025-12-04T09:17:11.3275776Z * [new branch] gh/kwen2501/277/base -> origin/gh/kwen2501/277/base 2025-12-04T09:17:11.3277995Z * [new branch] gh/kwen2501/277/head -> origin/gh/kwen2501/277/head 2025-12-04T09:17:11.3280055Z * [new branch] gh/kwen2501/277/orig -> origin/gh/kwen2501/277/orig 2025-12-04T09:17:11.3282879Z * [new branch] gh/kwen2501/278/base -> origin/gh/kwen2501/278/base 2025-12-04T09:17:11.3284933Z * [new branch] gh/kwen2501/278/head -> origin/gh/kwen2501/278/head 2025-12-04T09:17:11.3287069Z * [new branch] gh/kwen2501/278/orig -> origin/gh/kwen2501/278/orig 2025-12-04T09:17:11.3289900Z * [new branch] gh/kwen2501/279/base -> origin/gh/kwen2501/279/base 2025-12-04T09:17:11.3292084Z * [new branch] gh/kwen2501/279/head -> origin/gh/kwen2501/279/head 2025-12-04T09:17:11.3294118Z * [new branch] gh/kwen2501/279/orig -> origin/gh/kwen2501/279/orig 2025-12-04T09:17:11.3296968Z * [new branch] gh/kwen2501/280/base -> origin/gh/kwen2501/280/base 2025-12-04T09:17:11.3298982Z * [new branch] gh/kwen2501/280/head -> origin/gh/kwen2501/280/head 2025-12-04T09:17:11.3301366Z * [new branch] gh/kwen2501/280/orig -> origin/gh/kwen2501/280/orig 2025-12-04T09:17:11.3304224Z * [new branch] gh/kwen2501/281/base -> origin/gh/kwen2501/281/base 2025-12-04T09:17:11.3306395Z * [new branch] gh/kwen2501/281/head -> origin/gh/kwen2501/281/head 2025-12-04T09:17:11.3308484Z * [new branch] gh/kwen2501/281/orig -> origin/gh/kwen2501/281/orig 2025-12-04T09:17:11.3311272Z * [new branch] gh/kwen2501/282/base -> origin/gh/kwen2501/282/base 2025-12-04T09:17:11.3313374Z * [new branch] gh/kwen2501/282/head -> origin/gh/kwen2501/282/head 2025-12-04T09:17:11.3315493Z * [new branch] gh/kwen2501/282/orig -> origin/gh/kwen2501/282/orig 2025-12-04T09:17:11.3318434Z * [new branch] gh/kwen2501/283/base -> origin/gh/kwen2501/283/base 2025-12-04T09:17:11.3320702Z * [new branch] gh/kwen2501/283/head -> origin/gh/kwen2501/283/head 2025-12-04T09:17:11.3322718Z * [new branch] gh/kwen2501/283/orig -> origin/gh/kwen2501/283/orig 2025-12-04T09:17:11.3325684Z * [new branch] gh/kwen2501/284/base -> origin/gh/kwen2501/284/base 2025-12-04T09:17:11.3327715Z * [new branch] gh/kwen2501/284/head -> origin/gh/kwen2501/284/head 2025-12-04T09:17:11.3329760Z * [new branch] gh/kwen2501/284/orig -> origin/gh/kwen2501/284/orig 2025-12-04T09:17:11.3332537Z * [new branch] gh/kwen2501/285/base -> origin/gh/kwen2501/285/base 2025-12-04T09:17:11.3334820Z * [new branch] gh/kwen2501/285/head -> origin/gh/kwen2501/285/head 2025-12-04T09:17:11.3336755Z * [new branch] gh/kwen2501/285/orig -> origin/gh/kwen2501/285/orig 2025-12-04T09:17:11.3339584Z * [new branch] gh/kwen2501/286/base -> origin/gh/kwen2501/286/base 2025-12-04T09:17:11.3341644Z * [new branch] gh/kwen2501/286/head -> origin/gh/kwen2501/286/head 2025-12-04T09:17:11.3343695Z * [new branch] gh/kwen2501/286/orig -> origin/gh/kwen2501/286/orig 2025-12-04T09:17:11.3346437Z * [new branch] gh/kwen2501/287/base -> origin/gh/kwen2501/287/base 2025-12-04T09:17:11.3348743Z * [new branch] gh/kwen2501/287/head -> origin/gh/kwen2501/287/head 2025-12-04T09:17:11.3350700Z * [new branch] gh/kwen2501/287/orig -> origin/gh/kwen2501/287/orig 2025-12-04T09:17:11.3353580Z * [new branch] gh/kwen2501/288/base -> origin/gh/kwen2501/288/base 2025-12-04T09:17:11.3355760Z * [new branch] gh/kwen2501/288/head -> origin/gh/kwen2501/288/head 2025-12-04T09:17:11.3358409Z * [new branch] gh/kwen2501/288/orig -> origin/gh/kwen2501/288/orig 2025-12-04T09:17:11.3362475Z * [new branch] gh/laithsakka/251/base -> origin/gh/laithsakka/251/base 2025-12-04T09:17:11.3364490Z * [new branch] gh/laithsakka/251/head -> origin/gh/laithsakka/251/head 2025-12-04T09:17:11.3366524Z * [new branch] gh/laithsakka/251/orig -> origin/gh/laithsakka/251/orig 2025-12-04T09:17:11.3369302Z * [new branch] gh/laithsakka/276/base -> origin/gh/laithsakka/276/base 2025-12-04T09:17:11.3371364Z * [new branch] gh/laithsakka/276/head -> origin/gh/laithsakka/276/head 2025-12-04T09:17:11.3373404Z * [new branch] gh/laithsakka/276/orig -> origin/gh/laithsakka/276/orig 2025-12-04T09:17:11.3376182Z * [new branch] gh/laithsakka/28/base -> origin/gh/laithsakka/28/base 2025-12-04T09:17:11.3378766Z * [new branch] gh/laithsakka/29/base -> origin/gh/laithsakka/29/base 2025-12-04T09:17:11.3381342Z * [new branch] gh/laithsakka/30/base -> origin/gh/laithsakka/30/base 2025-12-04T09:17:11.3383382Z * [new branch] gh/laithsakka/30/head -> origin/gh/laithsakka/30/head 2025-12-04T09:17:11.3386025Z * [new branch] gh/laithsakka/31/base -> origin/gh/laithsakka/31/base 2025-12-04T09:17:11.3388088Z * [new branch] gh/laithsakka/31/head -> origin/gh/laithsakka/31/head 2025-12-04T09:17:11.3390892Z * [new branch] gh/laithsakka/313/base -> origin/gh/laithsakka/313/base 2025-12-04T09:17:11.3392886Z * [new branch] gh/laithsakka/313/head -> origin/gh/laithsakka/313/head 2025-12-04T09:17:11.3394931Z * [new branch] gh/laithsakka/313/orig -> origin/gh/laithsakka/313/orig 2025-12-04T09:17:11.3397909Z * [new branch] gh/laithsakka/316/base -> origin/gh/laithsakka/316/base 2025-12-04T09:17:11.3399845Z * [new branch] gh/laithsakka/316/head -> origin/gh/laithsakka/316/head 2025-12-04T09:17:11.3404704Z * [new branch] gh/laithsakka/316/orig -> origin/gh/laithsakka/316/orig 2025-12-04T09:17:11.3407436Z * [new branch] gh/laithsakka/317/base -> origin/gh/laithsakka/317/base 2025-12-04T09:17:11.3409409Z * [new branch] gh/laithsakka/317/head -> origin/gh/laithsakka/317/head 2025-12-04T09:17:11.3411526Z * [new branch] gh/laithsakka/317/orig -> origin/gh/laithsakka/317/orig 2025-12-04T09:17:11.3414202Z * [new branch] gh/laithsakka/319/base -> origin/gh/laithsakka/319/base 2025-12-04T09:17:11.3416244Z * [new branch] gh/laithsakka/319/head -> origin/gh/laithsakka/319/head 2025-12-04T09:17:11.3418424Z * [new branch] gh/laithsakka/319/orig -> origin/gh/laithsakka/319/orig 2025-12-04T09:17:11.3421009Z * [new branch] gh/laithsakka/32/base -> origin/gh/laithsakka/32/base 2025-12-04T09:17:11.3422990Z * [new branch] gh/laithsakka/32/head -> origin/gh/laithsakka/32/head 2025-12-04T09:17:11.3425807Z * [new branch] gh/laithsakka/320/base -> origin/gh/laithsakka/320/base 2025-12-04T09:17:11.3427878Z * [new branch] gh/laithsakka/320/head -> origin/gh/laithsakka/320/head 2025-12-04T09:17:11.3429771Z * [new branch] gh/laithsakka/320/orig -> origin/gh/laithsakka/320/orig 2025-12-04T09:17:11.3432815Z * [new branch] gh/laithsakka/321/base -> origin/gh/laithsakka/321/base 2025-12-04T09:17:11.3434984Z * [new branch] gh/laithsakka/321/head -> origin/gh/laithsakka/321/head 2025-12-04T09:17:11.3436956Z * [new branch] gh/laithsakka/321/orig -> origin/gh/laithsakka/321/orig 2025-12-04T09:17:11.3440005Z * [new branch] gh/laithsakka/322/base -> origin/gh/laithsakka/322/base 2025-12-04T09:17:11.3442216Z * [new branch] gh/laithsakka/322/head -> origin/gh/laithsakka/322/head 2025-12-04T09:17:11.3444231Z * [new branch] gh/laithsakka/322/orig -> origin/gh/laithsakka/322/orig 2025-12-04T09:17:11.3447660Z * [new branch] gh/laithsakka/323/base -> origin/gh/laithsakka/323/base 2025-12-04T09:17:11.3449889Z * [new branch] gh/laithsakka/323/head -> origin/gh/laithsakka/323/head 2025-12-04T09:17:11.3451976Z * [new branch] gh/laithsakka/323/orig -> origin/gh/laithsakka/323/orig 2025-12-04T09:17:11.3454794Z * [new branch] gh/laithsakka/324/base -> origin/gh/laithsakka/324/base 2025-12-04T09:17:11.3456793Z * [new branch] gh/laithsakka/324/head -> origin/gh/laithsakka/324/head 2025-12-04T09:17:11.3458821Z * [new branch] gh/laithsakka/324/orig -> origin/gh/laithsakka/324/orig 2025-12-04T09:17:11.3461725Z * [new branch] gh/laithsakka/325/base -> origin/gh/laithsakka/325/base 2025-12-04T09:17:11.3463772Z * [new branch] gh/laithsakka/325/head -> origin/gh/laithsakka/325/head 2025-12-04T09:17:11.3465889Z * [new branch] gh/laithsakka/325/orig -> origin/gh/laithsakka/325/orig 2025-12-04T09:17:11.3468979Z * [new branch] gh/laithsakka/326/base -> origin/gh/laithsakka/326/base 2025-12-04T09:17:11.3471022Z * [new branch] gh/laithsakka/326/head -> origin/gh/laithsakka/326/head 2025-12-04T09:17:11.3473071Z * [new branch] gh/laithsakka/326/orig -> origin/gh/laithsakka/326/orig 2025-12-04T09:17:11.3475998Z * [new branch] gh/laithsakka/327/base -> origin/gh/laithsakka/327/base 2025-12-04T09:17:11.3478134Z * [new branch] gh/laithsakka/327/head -> origin/gh/laithsakka/327/head 2025-12-04T09:17:11.3480331Z * [new branch] gh/laithsakka/327/orig -> origin/gh/laithsakka/327/orig 2025-12-04T09:17:11.3483192Z * [new branch] gh/laithsakka/328/base -> origin/gh/laithsakka/328/base 2025-12-04T09:17:11.3485202Z * [new branch] gh/laithsakka/328/head -> origin/gh/laithsakka/328/head 2025-12-04T09:17:11.3487233Z * [new branch] gh/laithsakka/328/orig -> origin/gh/laithsakka/328/orig 2025-12-04T09:17:11.3490467Z * [new branch] gh/liangel/4/base -> origin/gh/liangel/4/base 2025-12-04T09:17:11.3492514Z * [new branch] gh/liangel/4/head -> origin/gh/liangel/4/head 2025-12-04T09:17:11.3494575Z * [new branch] gh/liangel/4/orig -> origin/gh/liangel/4/orig 2025-12-04T09:17:11.3499618Z * [new branch] gh/lucaskabela/1/base -> origin/gh/lucaskabela/1/base 2025-12-04T09:17:11.3502018Z * [new branch] gh/lucaskabela/1/head -> origin/gh/lucaskabela/1/head 2025-12-04T09:17:11.3505266Z * [new branch] gh/lw/4/base -> origin/gh/lw/4/base 2025-12-04T09:17:11.3507275Z * [new branch] gh/lw/4/head -> origin/gh/lw/4/head 2025-12-04T09:17:11.3509796Z * [new branch] gh/lw/4/orig -> origin/gh/lw/4/orig 2025-12-04T09:17:11.3512513Z * [new branch] gh/lw/5/base -> origin/gh/lw/5/base 2025-12-04T09:17:11.3514542Z * [new branch] gh/lw/5/head -> origin/gh/lw/5/head 2025-12-04T09:17:11.3517079Z * [new branch] gh/lw/5/orig -> origin/gh/lw/5/orig 2025-12-04T09:17:11.3519811Z * [new branch] gh/lw/6/base -> origin/gh/lw/6/base 2025-12-04T09:17:11.3522071Z * [new branch] gh/lw/6/head -> origin/gh/lw/6/head 2025-12-04T09:17:11.3523959Z * [new branch] gh/lw/6/orig -> origin/gh/lw/6/orig 2025-12-04T09:17:11.3527321Z * [new branch] gh/malfet/14/base -> origin/gh/malfet/14/base 2025-12-04T09:17:11.3530060Z * [new branch] gh/malfet/417/base -> origin/gh/malfet/417/base 2025-12-04T09:17:11.3532085Z * [new branch] gh/malfet/417/head -> origin/gh/malfet/417/head 2025-12-04T09:17:11.3534115Z * [new branch] gh/malfet/417/orig -> origin/gh/malfet/417/orig 2025-12-04T09:17:11.3537309Z * [new branch] gh/malfet/506/base -> origin/gh/malfet/506/base 2025-12-04T09:17:11.3539371Z * [new branch] gh/malfet/506/head -> origin/gh/malfet/506/head 2025-12-04T09:17:11.3541396Z * [new branch] gh/malfet/506/orig -> origin/gh/malfet/506/orig 2025-12-04T09:17:11.3544158Z * [new branch] gh/malfet/517/base -> origin/gh/malfet/517/base 2025-12-04T09:17:11.3546238Z * [new branch] gh/malfet/517/head -> origin/gh/malfet/517/head 2025-12-04T09:17:11.3548903Z * [new branch] gh/malfet/528/base -> origin/gh/malfet/528/base 2025-12-04T09:17:11.3550919Z * [new branch] gh/malfet/528/head -> origin/gh/malfet/528/head 2025-12-04T09:17:11.3552972Z * [new branch] gh/malfet/528/orig -> origin/gh/malfet/528/orig 2025-12-04T09:17:11.3555865Z * [new branch] gh/malfet/537/base -> origin/gh/malfet/537/base 2025-12-04T09:17:11.3558322Z * [new branch] gh/malfet/537/head -> origin/gh/malfet/537/head 2025-12-04T09:17:11.3561043Z * [new branch] gh/malfet/537/orig -> origin/gh/malfet/537/orig 2025-12-04T09:17:11.3563698Z * [new branch] gh/malfet/546/base -> origin/gh/malfet/546/base 2025-12-04T09:17:11.3565515Z * [new branch] gh/malfet/546/head -> origin/gh/malfet/546/head 2025-12-04T09:17:11.3567488Z * [new branch] gh/malfet/546/orig -> origin/gh/malfet/546/orig 2025-12-04T09:17:11.3570210Z * [new branch] gh/malfet/565/base -> origin/gh/malfet/565/base 2025-12-04T09:17:11.3572250Z * [new branch] gh/malfet/565/head -> origin/gh/malfet/565/head 2025-12-04T09:17:11.3574324Z * [new branch] gh/malfet/565/orig -> origin/gh/malfet/565/orig 2025-12-04T09:17:11.3577043Z * [new branch] gh/malfet/575/base -> origin/gh/malfet/575/base 2025-12-04T09:17:11.3579082Z * [new branch] gh/malfet/575/head -> origin/gh/malfet/575/head 2025-12-04T09:17:11.3581128Z * [new branch] gh/malfet/575/orig -> origin/gh/malfet/575/orig 2025-12-04T09:17:11.3584021Z * [new branch] gh/malfet/580/base -> origin/gh/malfet/580/base 2025-12-04T09:17:11.3586025Z * [new branch] gh/malfet/580/head -> origin/gh/malfet/580/head 2025-12-04T09:17:11.3588109Z * [new branch] gh/malfet/580/orig -> origin/gh/malfet/580/orig 2025-12-04T09:17:11.3590797Z * [new branch] gh/malfet/581/base -> origin/gh/malfet/581/base 2025-12-04T09:17:11.3592862Z * [new branch] gh/malfet/581/head -> origin/gh/malfet/581/head 2025-12-04T09:17:11.3594900Z * [new branch] gh/malfet/581/orig -> origin/gh/malfet/581/orig 2025-12-04T09:17:11.3597587Z * [new branch] gh/malfet/583/base -> origin/gh/malfet/583/base 2025-12-04T09:17:11.3599715Z * [new branch] gh/malfet/583/head -> origin/gh/malfet/583/head 2025-12-04T09:17:11.3602161Z * [new branch] gh/malfet/583/orig -> origin/gh/malfet/583/orig 2025-12-04T09:17:11.3605322Z * [new branch] gh/malfet/586/base -> origin/gh/malfet/586/base 2025-12-04T09:17:11.3607513Z * [new branch] gh/malfet/586/head -> origin/gh/malfet/586/head 2025-12-04T09:17:11.3609405Z * [new branch] gh/malfet/586/orig -> origin/gh/malfet/586/orig 2025-12-04T09:17:11.3612179Z * [new branch] gh/malfet/587/base -> origin/gh/malfet/587/base 2025-12-04T09:17:11.3614256Z * [new branch] gh/malfet/587/head -> origin/gh/malfet/587/head 2025-12-04T09:17:11.3616266Z * [new branch] gh/malfet/587/orig -> origin/gh/malfet/587/orig 2025-12-04T09:17:11.3618960Z * [new branch] gh/malfet/588/base -> origin/gh/malfet/588/base 2025-12-04T09:17:11.3620987Z * [new branch] gh/malfet/588/head -> origin/gh/malfet/588/head 2025-12-04T09:17:11.3623193Z * [new branch] gh/malfet/588/orig -> origin/gh/malfet/588/orig 2025-12-04T09:17:11.3625942Z * [new branch] gh/malfet/589/base -> origin/gh/malfet/589/base 2025-12-04T09:17:11.3628106Z * [new branch] gh/malfet/589/head -> origin/gh/malfet/589/head 2025-12-04T09:17:11.3630136Z * [new branch] gh/malfet/589/orig -> origin/gh/malfet/589/orig 2025-12-04T09:17:11.3632838Z * [new branch] gh/malfet/590/base -> origin/gh/malfet/590/base 2025-12-04T09:17:11.3634851Z * [new branch] gh/malfet/590/head -> origin/gh/malfet/590/head 2025-12-04T09:17:11.3636911Z * [new branch] gh/malfet/590/orig -> origin/gh/malfet/590/orig 2025-12-04T09:17:11.3640429Z * [new branch] gh/malfet/591/base -> origin/gh/malfet/591/base 2025-12-04T09:17:11.3642520Z * [new branch] gh/malfet/591/head -> origin/gh/malfet/591/head 2025-12-04T09:17:11.3644554Z * [new branch] gh/malfet/591/orig -> origin/gh/malfet/591/orig 2025-12-04T09:17:11.3647356Z * [new branch] gh/malfet/592/base -> origin/gh/malfet/592/base 2025-12-04T09:17:11.3649385Z * [new branch] gh/malfet/592/head -> origin/gh/malfet/592/head 2025-12-04T09:17:11.3651426Z * [new branch] gh/malfet/592/orig -> origin/gh/malfet/592/orig 2025-12-04T09:17:11.3654184Z * [new branch] gh/malfet/593/base -> origin/gh/malfet/593/base 2025-12-04T09:17:11.3656187Z * [new branch] gh/malfet/593/head -> origin/gh/malfet/593/head 2025-12-04T09:17:11.3658286Z * [new branch] gh/malfet/593/orig -> origin/gh/malfet/593/orig 2025-12-04T09:17:11.3661075Z * [new branch] gh/malfet/594/base -> origin/gh/malfet/594/base 2025-12-04T09:17:11.3663413Z * [new branch] gh/malfet/594/head -> origin/gh/malfet/594/head 2025-12-04T09:17:11.3665458Z * [new branch] gh/malfet/594/orig -> origin/gh/malfet/594/orig 2025-12-04T09:17:11.3668247Z * [new branch] gh/malfet/595/base -> origin/gh/malfet/595/base 2025-12-04T09:17:11.3670157Z * [new branch] gh/malfet/595/head -> origin/gh/malfet/595/head 2025-12-04T09:17:11.3672184Z * [new branch] gh/malfet/595/orig -> origin/gh/malfet/595/orig 2025-12-04T09:17:11.3674956Z * [new branch] gh/malfet/596/base -> origin/gh/malfet/596/base 2025-12-04T09:17:11.3676959Z * [new branch] gh/malfet/596/head -> origin/gh/malfet/596/head 2025-12-04T09:17:11.3679140Z * [new branch] gh/malfet/596/orig -> origin/gh/malfet/596/orig 2025-12-04T09:17:11.3682062Z * [new branch] gh/malfet/597/base -> origin/gh/malfet/597/base 2025-12-04T09:17:11.3684063Z * [new branch] gh/malfet/597/head -> origin/gh/malfet/597/head 2025-12-04T09:17:11.3686086Z * [new branch] gh/malfet/597/orig -> origin/gh/malfet/597/orig 2025-12-04T09:17:11.3688870Z * [new branch] gh/malfet/598/base -> origin/gh/malfet/598/base 2025-12-04T09:17:11.3691086Z * [new branch] gh/malfet/598/head -> origin/gh/malfet/598/head 2025-12-04T09:17:11.3693087Z * [new branch] gh/malfet/598/orig -> origin/gh/malfet/598/orig 2025-12-04T09:17:11.3696011Z * [new branch] gh/malfet/599/base -> origin/gh/malfet/599/base 2025-12-04T09:17:11.3698088Z * [new branch] gh/malfet/599/head -> origin/gh/malfet/599/head 2025-12-04T09:17:11.3700189Z * [new branch] gh/malfet/599/orig -> origin/gh/malfet/599/orig 2025-12-04T09:17:11.3704203Z * [new branch] gh/malfet/600/base -> origin/gh/malfet/600/base 2025-12-04T09:17:11.3706182Z * [new branch] gh/malfet/600/head -> origin/gh/malfet/600/head 2025-12-04T09:17:11.3708361Z * [new branch] gh/malfet/600/orig -> origin/gh/malfet/600/orig 2025-12-04T09:17:11.3711321Z * [new branch] gh/malfet/601/base -> origin/gh/malfet/601/base 2025-12-04T09:17:11.3713613Z * [new branch] gh/malfet/601/head -> origin/gh/malfet/601/head 2025-12-04T09:17:11.3715521Z * [new branch] gh/malfet/601/orig -> origin/gh/malfet/601/orig 2025-12-04T09:17:11.3718377Z * [new branch] gh/malfet/602/base -> origin/gh/malfet/602/base 2025-12-04T09:17:11.3720429Z * [new branch] gh/malfet/602/head -> origin/gh/malfet/602/head 2025-12-04T09:17:11.3722639Z * [new branch] gh/malfet/602/orig -> origin/gh/malfet/602/orig 2025-12-04T09:17:11.3725416Z * [new branch] gh/malfet/603/base -> origin/gh/malfet/603/base 2025-12-04T09:17:11.3727848Z * [new branch] gh/malfet/603/head -> origin/gh/malfet/603/head 2025-12-04T09:17:11.3729651Z * [new branch] gh/malfet/603/orig -> origin/gh/malfet/603/orig 2025-12-04T09:17:11.3732668Z * [new branch] gh/malfet/604/base -> origin/gh/malfet/604/base 2025-12-04T09:17:11.3734533Z * [new branch] gh/malfet/604/head -> origin/gh/malfet/604/head 2025-12-04T09:17:11.3736595Z * [new branch] gh/malfet/604/orig -> origin/gh/malfet/604/orig 2025-12-04T09:17:11.3739449Z * [new branch] gh/malfet/605/base -> origin/gh/malfet/605/base 2025-12-04T09:17:11.3741556Z * [new branch] gh/malfet/605/head -> origin/gh/malfet/605/head 2025-12-04T09:17:11.3743676Z * [new branch] gh/malfet/605/orig -> origin/gh/malfet/605/orig 2025-12-04T09:17:11.3746568Z * [new branch] gh/malfet/606/base -> origin/gh/malfet/606/base 2025-12-04T09:17:11.3748767Z * [new branch] gh/malfet/606/head -> origin/gh/malfet/606/head 2025-12-04T09:17:11.3750896Z * [new branch] gh/malfet/606/orig -> origin/gh/malfet/606/orig 2025-12-04T09:17:11.3753800Z * [new branch] gh/malfet/607/base -> origin/gh/malfet/607/base 2025-12-04T09:17:11.3755824Z * [new branch] gh/malfet/607/head -> origin/gh/malfet/607/head 2025-12-04T09:17:11.3757967Z * [new branch] gh/malfet/607/orig -> origin/gh/malfet/607/orig 2025-12-04T09:17:11.3761013Z * [new branch] gh/malfet/608/base -> origin/gh/malfet/608/base 2025-12-04T09:17:11.3763041Z * [new branch] gh/malfet/608/head -> origin/gh/malfet/608/head 2025-12-04T09:17:11.3765033Z * [new branch] gh/malfet/608/orig -> origin/gh/malfet/608/orig 2025-12-04T09:17:11.3767945Z * [new branch] gh/malfet/609/base -> origin/gh/malfet/609/base 2025-12-04T09:17:11.3769933Z * [new branch] gh/malfet/609/head -> origin/gh/malfet/609/head 2025-12-04T09:17:11.3771978Z * [new branch] gh/malfet/609/orig -> origin/gh/malfet/609/orig 2025-12-04T09:17:11.3774954Z * [new branch] gh/malfet/610/base -> origin/gh/malfet/610/base 2025-12-04T09:17:11.3776937Z * [new branch] gh/malfet/610/head -> origin/gh/malfet/610/head 2025-12-04T09:17:11.3779013Z * [new branch] gh/malfet/610/orig -> origin/gh/malfet/610/orig 2025-12-04T09:17:11.3781889Z * [new branch] gh/malfet/611/base -> origin/gh/malfet/611/base 2025-12-04T09:17:11.3783941Z * [new branch] gh/malfet/611/head -> origin/gh/malfet/611/head 2025-12-04T09:17:11.3786031Z * [new branch] gh/malfet/611/orig -> origin/gh/malfet/611/orig 2025-12-04T09:17:11.3789182Z * [new branch] gh/malfet/612/base -> origin/gh/malfet/612/base 2025-12-04T09:17:11.3791297Z * [new branch] gh/malfet/612/head -> origin/gh/malfet/612/head 2025-12-04T09:17:11.3793802Z * [new branch] gh/malfet/612/orig -> origin/gh/malfet/612/orig 2025-12-04T09:17:11.3796719Z * [new branch] gh/malfet/64/base -> origin/gh/malfet/64/base 2025-12-04T09:17:11.3799051Z * [new branch] gh/malfet/64/head -> origin/gh/malfet/64/head 2025-12-04T09:17:11.3802863Z * [new branch] gh/manuelcandales/11/base -> origin/gh/manuelcandales/11/base 2025-12-04T09:17:11.3804908Z * [new branch] gh/manuelcandales/11/head -> origin/gh/manuelcandales/11/head 2025-12-04T09:17:11.3806924Z * [new branch] gh/manuelcandales/11/orig -> origin/gh/manuelcandales/11/orig 2025-12-04T09:17:11.3810443Z * [new branch] gh/markkm/1/base -> origin/gh/markkm/1/base 2025-12-04T09:17:11.3813758Z * [new branch] gh/masnesral/1/base -> origin/gh/masnesral/1/base 2025-12-04T09:17:11.3815746Z * [new branch] gh/masnesral/1/head -> origin/gh/masnesral/1/head 2025-12-04T09:17:11.3817759Z * [new branch] gh/masnesral/1/orig -> origin/gh/masnesral/1/orig 2025-12-04T09:17:11.3821457Z * [new branch] gh/mhorowitz/0/base -> origin/gh/mhorowitz/0/base 2025-12-04T09:17:11.3823498Z * [new branch] gh/mhorowitz/0/head -> origin/gh/mhorowitz/0/head 2025-12-04T09:17:11.3826027Z * [new branch] gh/mhorowitz/1/base -> origin/gh/mhorowitz/1/base 2025-12-04T09:17:11.3828101Z * [new branch] gh/mhorowitz/1/head -> origin/gh/mhorowitz/1/head 2025-12-04T09:17:11.3830703Z * [new branch] gh/mhorowitz/2/base -> origin/gh/mhorowitz/2/base 2025-12-04T09:17:11.3832725Z * [new branch] gh/mhorowitz/2/head -> origin/gh/mhorowitz/2/head 2025-12-04T09:17:11.3835246Z * [new branch] gh/mhorowitz/3/base -> origin/gh/mhorowitz/3/base 2025-12-04T09:17:11.3837288Z * [new branch] gh/mhorowitz/3/head -> origin/gh/mhorowitz/3/head 2025-12-04T09:17:11.3840008Z * [new branch] gh/mhorowitz/4/base -> origin/gh/mhorowitz/4/base 2025-12-04T09:17:11.3842090Z * [new branch] gh/mhorowitz/4/head -> origin/gh/mhorowitz/4/head 2025-12-04T09:17:11.3844647Z * [new branch] gh/mhorowitz/5/base -> origin/gh/mhorowitz/5/base 2025-12-04T09:17:11.3846559Z * [new branch] gh/mhorowitz/5/head -> origin/gh/mhorowitz/5/head 2025-12-04T09:17:11.3849074Z * [new branch] gh/mhorowitz/6/base -> origin/gh/mhorowitz/6/base 2025-12-04T09:17:11.3851007Z * [new branch] gh/mhorowitz/6/head -> origin/gh/mhorowitz/6/head 2025-12-04T09:17:11.3854401Z * [new branch] gh/mikaylagawarecki/234/base -> origin/gh/mikaylagawarecki/234/base 2025-12-04T09:17:11.3856481Z * [new branch] gh/mikaylagawarecki/234/head -> origin/gh/mikaylagawarecki/234/head 2025-12-04T09:17:11.3859516Z * [new branch] gh/mikaylagawarecki/235/base -> origin/gh/mikaylagawarecki/235/base 2025-12-04T09:17:11.3861742Z * [new branch] gh/mikaylagawarecki/235/head -> origin/gh/mikaylagawarecki/235/head 2025-12-04T09:17:11.3864218Z * [new branch] gh/mikaylagawarecki/236/base -> origin/gh/mikaylagawarecki/236/base 2025-12-04T09:17:11.3866193Z * [new branch] gh/mikaylagawarecki/236/head -> origin/gh/mikaylagawarecki/236/head 2025-12-04T09:17:11.3868891Z * [new branch] gh/mikaylagawarecki/237/base -> origin/gh/mikaylagawarecki/237/base 2025-12-04T09:17:11.3870702Z * [new branch] gh/mikaylagawarecki/237/head -> origin/gh/mikaylagawarecki/237/head 2025-12-04T09:17:11.3873403Z * [new branch] gh/mikaylagawarecki/238/base -> origin/gh/mikaylagawarecki/238/base 2025-12-04T09:17:11.3875382Z * [new branch] gh/mikaylagawarecki/238/head -> origin/gh/mikaylagawarecki/238/head 2025-12-04T09:17:11.3878155Z * [new branch] gh/mikaylagawarecki/336/base -> origin/gh/mikaylagawarecki/336/base 2025-12-04T09:17:11.3880416Z * [new branch] gh/mikaylagawarecki/336/head -> origin/gh/mikaylagawarecki/336/head 2025-12-04T09:17:11.3882437Z * [new branch] gh/mikaylagawarecki/336/orig -> origin/gh/mikaylagawarecki/336/orig 2025-12-04T09:17:11.3885295Z * [new branch] gh/mikaylagawarecki/341/base -> origin/gh/mikaylagawarecki/341/base 2025-12-04T09:17:11.3887254Z * [new branch] gh/mikaylagawarecki/341/head -> origin/gh/mikaylagawarecki/341/head 2025-12-04T09:17:11.3889351Z * [new branch] gh/mikaylagawarecki/341/orig -> origin/gh/mikaylagawarecki/341/orig 2025-12-04T09:17:11.3892186Z * [new branch] gh/mikaylagawarecki/342/base -> origin/gh/mikaylagawarecki/342/base 2025-12-04T09:17:11.3894212Z * [new branch] gh/mikaylagawarecki/342/head -> origin/gh/mikaylagawarecki/342/head 2025-12-04T09:17:11.3896230Z * [new branch] gh/mikaylagawarecki/342/orig -> origin/gh/mikaylagawarecki/342/orig 2025-12-04T09:17:11.3899138Z * [new branch] gh/mikaylagawarecki/345/base -> origin/gh/mikaylagawarecki/345/base 2025-12-04T09:17:11.3901363Z * [new branch] gh/mikaylagawarecki/345/head -> origin/gh/mikaylagawarecki/345/head 2025-12-04T09:17:11.3903540Z * [new branch] gh/mikaylagawarecki/345/orig -> origin/gh/mikaylagawarecki/345/orig 2025-12-04T09:17:11.3906364Z * [new branch] gh/mikaylagawarecki/346/base -> origin/gh/mikaylagawarecki/346/base 2025-12-04T09:17:11.3908613Z * [new branch] gh/mikaylagawarecki/346/head -> origin/gh/mikaylagawarecki/346/head 2025-12-04T09:17:11.3910689Z * [new branch] gh/mikaylagawarecki/346/orig -> origin/gh/mikaylagawarecki/346/orig 2025-12-04T09:17:11.3913466Z * [new branch] gh/mikaylagawarecki/347/base -> origin/gh/mikaylagawarecki/347/base 2025-12-04T09:17:11.3915448Z * [new branch] gh/mikaylagawarecki/347/head -> origin/gh/mikaylagawarecki/347/head 2025-12-04T09:17:11.3917481Z * [new branch] gh/mikaylagawarecki/347/orig -> origin/gh/mikaylagawarecki/347/orig 2025-12-04T09:17:11.3920506Z * [new branch] gh/mikaylagawarecki/350/base -> origin/gh/mikaylagawarecki/350/base 2025-12-04T09:17:11.3922290Z * [new branch] gh/mikaylagawarecki/350/head -> origin/gh/mikaylagawarecki/350/head 2025-12-04T09:17:11.3924485Z * [new branch] gh/mikaylagawarecki/350/orig -> origin/gh/mikaylagawarecki/350/orig 2025-12-04T09:17:11.3927640Z * [new branch] gh/mikaylagawarecki/351/base -> origin/gh/mikaylagawarecki/351/base 2025-12-04T09:17:11.3929761Z * [new branch] gh/mikaylagawarecki/351/head -> origin/gh/mikaylagawarecki/351/head 2025-12-04T09:17:11.3931882Z * [new branch] gh/mikaylagawarecki/351/orig -> origin/gh/mikaylagawarecki/351/orig 2025-12-04T09:17:11.3935311Z * [new branch] gh/mikaylagawarecki/352/base -> origin/gh/mikaylagawarecki/352/base 2025-12-04T09:17:11.3937697Z * [new branch] gh/mikaylagawarecki/352/head -> origin/gh/mikaylagawarecki/352/head 2025-12-04T09:17:11.3939817Z * [new branch] gh/mikaylagawarecki/352/orig -> origin/gh/mikaylagawarecki/352/orig 2025-12-04T09:17:11.3942557Z * [new branch] gh/mikaylagawarecki/353/base -> origin/gh/mikaylagawarecki/353/base 2025-12-04T09:17:11.3944823Z * [new branch] gh/mikaylagawarecki/353/head -> origin/gh/mikaylagawarecki/353/head 2025-12-04T09:17:11.3946948Z * [new branch] gh/mikaylagawarecki/353/orig -> origin/gh/mikaylagawarecki/353/orig 2025-12-04T09:17:11.3949542Z * [new branch] gh/mikaylagawarecki/354/base -> origin/gh/mikaylagawarecki/354/base 2025-12-04T09:17:11.3951625Z * [new branch] gh/mikaylagawarecki/354/head -> origin/gh/mikaylagawarecki/354/head 2025-12-04T09:17:11.3953612Z * [new branch] gh/mikaylagawarecki/354/orig -> origin/gh/mikaylagawarecki/354/orig 2025-12-04T09:17:11.3956887Z * [new branch] gh/mikaylagawarecki/356/base -> origin/gh/mikaylagawarecki/356/base 2025-12-04T09:17:11.3958981Z * [new branch] gh/mikaylagawarecki/356/head -> origin/gh/mikaylagawarecki/356/head 2025-12-04T09:17:11.3961281Z * [new branch] gh/mikaylagawarecki/356/orig -> origin/gh/mikaylagawarecki/356/orig 2025-12-04T09:17:11.3963966Z * [new branch] gh/mikaylagawarecki/357/base -> origin/gh/mikaylagawarecki/357/base 2025-12-04T09:17:11.3966214Z * [new branch] gh/mikaylagawarecki/357/head -> origin/gh/mikaylagawarecki/357/head 2025-12-04T09:17:11.3968222Z * [new branch] gh/mikaylagawarecki/357/orig -> origin/gh/mikaylagawarecki/357/orig 2025-12-04T09:17:11.3971152Z * [new branch] gh/mikaylagawarecki/359/base -> origin/gh/mikaylagawarecki/359/base 2025-12-04T09:17:11.3973265Z * [new branch] gh/mikaylagawarecki/359/head -> origin/gh/mikaylagawarecki/359/head 2025-12-04T09:17:11.3975361Z * [new branch] gh/mikaylagawarecki/359/orig -> origin/gh/mikaylagawarecki/359/orig 2025-12-04T09:17:11.3978163Z * [new branch] gh/mikaylagawarecki/360/base -> origin/gh/mikaylagawarecki/360/base 2025-12-04T09:17:11.3980766Z * [new branch] gh/mikaylagawarecki/360/head -> origin/gh/mikaylagawarecki/360/head 2025-12-04T09:17:11.3982799Z * [new branch] gh/mikaylagawarecki/360/orig -> origin/gh/mikaylagawarecki/360/orig 2025-12-04T09:17:11.3985729Z * [new branch] gh/mikaylagawarecki/361/base -> origin/gh/mikaylagawarecki/361/base 2025-12-04T09:17:11.3987742Z * [new branch] gh/mikaylagawarecki/361/head -> origin/gh/mikaylagawarecki/361/head 2025-12-04T09:17:11.3989797Z * [new branch] gh/mikaylagawarecki/361/orig -> origin/gh/mikaylagawarecki/361/orig 2025-12-04T09:17:11.3992657Z * [new branch] gh/mikaylagawarecki/362/base -> origin/gh/mikaylagawarecki/362/base 2025-12-04T09:17:11.3995025Z * [new branch] gh/mikaylagawarecki/362/head -> origin/gh/mikaylagawarecki/362/head 2025-12-04T09:17:11.3997043Z * [new branch] gh/mikaylagawarecki/362/orig -> origin/gh/mikaylagawarecki/362/orig 2025-12-04T09:17:11.4000427Z * [new branch] gh/mikaylagawarecki/363/base -> origin/gh/mikaylagawarecki/363/base 2025-12-04T09:17:11.4004993Z * [new branch] gh/mikaylagawarecki/363/head -> origin/gh/mikaylagawarecki/363/head 2025-12-04T09:17:11.4007136Z * [new branch] gh/mikaylagawarecki/363/orig -> origin/gh/mikaylagawarecki/363/orig 2025-12-04T09:17:11.4010399Z * [new branch] gh/mikaylagawarecki/364/base -> origin/gh/mikaylagawarecki/364/base 2025-12-04T09:17:11.4012440Z * [new branch] gh/mikaylagawarecki/364/head -> origin/gh/mikaylagawarecki/364/head 2025-12-04T09:17:11.4014553Z * [new branch] gh/mikaylagawarecki/364/orig -> origin/gh/mikaylagawarecki/364/orig 2025-12-04T09:17:11.4017521Z * [new branch] gh/mikaylagawarecki/365/base -> origin/gh/mikaylagawarecki/365/base 2025-12-04T09:17:11.4020164Z * [new branch] gh/mikaylagawarecki/365/head -> origin/gh/mikaylagawarecki/365/head 2025-12-04T09:17:11.4022183Z * [new branch] gh/mikaylagawarecki/365/orig -> origin/gh/mikaylagawarecki/365/orig 2025-12-04T09:17:11.4025091Z * [new branch] gh/mikaylagawarecki/366/base -> origin/gh/mikaylagawarecki/366/base 2025-12-04T09:17:11.4027229Z * [new branch] gh/mikaylagawarecki/366/head -> origin/gh/mikaylagawarecki/366/head 2025-12-04T09:17:11.4029328Z * [new branch] gh/mikaylagawarecki/366/orig -> origin/gh/mikaylagawarecki/366/orig 2025-12-04T09:17:11.4032184Z * [new branch] gh/mikaylagawarecki/367/base -> origin/gh/mikaylagawarecki/367/base 2025-12-04T09:17:11.4034236Z * [new branch] gh/mikaylagawarecki/367/head -> origin/gh/mikaylagawarecki/367/head 2025-12-04T09:17:11.4036276Z * [new branch] gh/mikaylagawarecki/367/orig -> origin/gh/mikaylagawarecki/367/orig 2025-12-04T09:17:11.4039281Z * [new branch] gh/mikaylagawarecki/368/base -> origin/gh/mikaylagawarecki/368/base 2025-12-04T09:17:11.4041593Z * [new branch] gh/mikaylagawarecki/368/head -> origin/gh/mikaylagawarecki/368/head 2025-12-04T09:17:11.4043725Z * [new branch] gh/mikaylagawarecki/368/orig -> origin/gh/mikaylagawarecki/368/orig 2025-12-04T09:17:11.4046553Z * [new branch] gh/mikaylagawarecki/369/base -> origin/gh/mikaylagawarecki/369/base 2025-12-04T09:17:11.4048745Z * [new branch] gh/mikaylagawarecki/369/head -> origin/gh/mikaylagawarecki/369/head 2025-12-04T09:17:11.4050750Z * [new branch] gh/mikaylagawarecki/369/orig -> origin/gh/mikaylagawarecki/369/orig 2025-12-04T09:17:11.4053715Z * [new branch] gh/mikaylagawarecki/370/base -> origin/gh/mikaylagawarecki/370/base 2025-12-04T09:17:11.4055902Z * [new branch] gh/mikaylagawarecki/370/head -> origin/gh/mikaylagawarecki/370/head 2025-12-04T09:17:11.4057927Z * [new branch] gh/mikaylagawarecki/370/orig -> origin/gh/mikaylagawarecki/370/orig 2025-12-04T09:17:11.4060820Z * [new branch] gh/mikaylagawarecki/371/base -> origin/gh/mikaylagawarecki/371/base 2025-12-04T09:17:11.4062886Z * [new branch] gh/mikaylagawarecki/371/head -> origin/gh/mikaylagawarecki/371/head 2025-12-04T09:17:11.4064942Z * [new branch] gh/mikaylagawarecki/371/orig -> origin/gh/mikaylagawarecki/371/orig 2025-12-04T09:17:11.4067821Z * [new branch] gh/mikaylagawarecki/372/base -> origin/gh/mikaylagawarecki/372/base 2025-12-04T09:17:11.4069906Z * [new branch] gh/mikaylagawarecki/372/head -> origin/gh/mikaylagawarecki/372/head 2025-12-04T09:17:11.4072022Z * [new branch] gh/mikaylagawarecki/372/orig -> origin/gh/mikaylagawarecki/372/orig 2025-12-04T09:17:11.4074672Z * [new branch] gh/mikaylagawarecki/373/base -> origin/gh/mikaylagawarecki/373/base 2025-12-04T09:17:11.4076717Z * [new branch] gh/mikaylagawarecki/373/head -> origin/gh/mikaylagawarecki/373/head 2025-12-04T09:17:11.4078846Z * [new branch] gh/mikaylagawarecki/373/orig -> origin/gh/mikaylagawarecki/373/orig 2025-12-04T09:17:11.4081851Z * [new branch] gh/mikaylagawarecki/374/base -> origin/gh/mikaylagawarecki/374/base 2025-12-04T09:17:11.4083977Z * [new branch] gh/mikaylagawarecki/374/head -> origin/gh/mikaylagawarecki/374/head 2025-12-04T09:17:11.4086533Z * [new branch] gh/mikaylagawarecki/374/orig -> origin/gh/mikaylagawarecki/374/orig 2025-12-04T09:17:11.4089455Z * [new branch] gh/mikaylagawarecki/375/base -> origin/gh/mikaylagawarecki/375/base 2025-12-04T09:17:11.4091553Z * [new branch] gh/mikaylagawarecki/375/head -> origin/gh/mikaylagawarecki/375/head 2025-12-04T09:17:11.4093636Z * [new branch] gh/mikaylagawarecki/375/orig -> origin/gh/mikaylagawarecki/375/orig 2025-12-04T09:17:11.4096452Z * [new branch] gh/mikaylagawarecki/376/base -> origin/gh/mikaylagawarecki/376/base 2025-12-04T09:17:11.4098860Z * [new branch] gh/mikaylagawarecki/376/head -> origin/gh/mikaylagawarecki/376/head 2025-12-04T09:17:11.4100843Z * [new branch] gh/mikaylagawarecki/376/orig -> origin/gh/mikaylagawarecki/376/orig 2025-12-04T09:17:11.4103884Z * [new branch] gh/mikaylagawarecki/377/base -> origin/gh/mikaylagawarecki/377/base 2025-12-04T09:17:11.4106000Z * [new branch] gh/mikaylagawarecki/377/head -> origin/gh/mikaylagawarecki/377/head 2025-12-04T09:17:11.4108054Z * [new branch] gh/mikaylagawarecki/377/orig -> origin/gh/mikaylagawarecki/377/orig 2025-12-04T09:17:11.4110928Z * [new branch] gh/mikaylagawarecki/378/base -> origin/gh/mikaylagawarecki/378/base 2025-12-04T09:17:11.4113125Z * [new branch] gh/mikaylagawarecki/378/head -> origin/gh/mikaylagawarecki/378/head 2025-12-04T09:17:11.4115110Z * [new branch] gh/mikaylagawarecki/378/orig -> origin/gh/mikaylagawarecki/378/orig 2025-12-04T09:17:11.4117996Z * [new branch] gh/mikaylagawarecki/379/base -> origin/gh/mikaylagawarecki/379/base 2025-12-04T09:17:11.4120098Z * [new branch] gh/mikaylagawarecki/379/head -> origin/gh/mikaylagawarecki/379/head 2025-12-04T09:17:11.4122285Z * [new branch] gh/mikaylagawarecki/379/orig -> origin/gh/mikaylagawarecki/379/orig 2025-12-04T09:17:11.4124938Z * [new branch] gh/mikaylagawarecki/380/base -> origin/gh/mikaylagawarecki/380/base 2025-12-04T09:17:11.4127066Z * [new branch] gh/mikaylagawarecki/380/head -> origin/gh/mikaylagawarecki/380/head 2025-12-04T09:17:11.4129048Z * [new branch] gh/mikaylagawarecki/380/orig -> origin/gh/mikaylagawarecki/380/orig 2025-12-04T09:17:11.4131731Z * [new branch] gh/mikaylagawarecki/381/base -> origin/gh/mikaylagawarecki/381/base 2025-12-04T09:17:11.4133774Z * [new branch] gh/mikaylagawarecki/381/head -> origin/gh/mikaylagawarecki/381/head 2025-12-04T09:17:11.4135852Z * [new branch] gh/mikaylagawarecki/381/orig -> origin/gh/mikaylagawarecki/381/orig 2025-12-04T09:17:11.4139046Z * [new branch] gh/mikaylagawarecki/382/base -> origin/gh/mikaylagawarecki/382/base 2025-12-04T09:17:11.4141217Z * [new branch] gh/mikaylagawarecki/382/head -> origin/gh/mikaylagawarecki/382/head 2025-12-04T09:17:11.4143284Z * [new branch] gh/mikaylagawarecki/382/orig -> origin/gh/mikaylagawarecki/382/orig 2025-12-04T09:17:11.4146238Z * [new branch] gh/mikaylagawarecki/383/base -> origin/gh/mikaylagawarecki/383/base 2025-12-04T09:17:11.4148407Z * [new branch] gh/mikaylagawarecki/383/head -> origin/gh/mikaylagawarecki/383/head 2025-12-04T09:17:11.4150442Z * [new branch] gh/mikaylagawarecki/383/orig -> origin/gh/mikaylagawarecki/383/orig 2025-12-04T09:17:11.4153786Z * [new branch] gh/mikaylagawarecki/384/base -> origin/gh/mikaylagawarecki/384/base 2025-12-04T09:17:11.4155831Z * [new branch] gh/mikaylagawarecki/384/head -> origin/gh/mikaylagawarecki/384/head 2025-12-04T09:17:11.4157863Z * [new branch] gh/mikaylagawarecki/384/orig -> origin/gh/mikaylagawarecki/384/orig 2025-12-04T09:17:11.4160885Z * [new branch] gh/mikaylagawarecki/385/base -> origin/gh/mikaylagawarecki/385/base 2025-12-04T09:17:11.4163012Z * [new branch] gh/mikaylagawarecki/385/head -> origin/gh/mikaylagawarecki/385/head 2025-12-04T09:17:11.4165100Z * [new branch] gh/mikaylagawarecki/385/orig -> origin/gh/mikaylagawarecki/385/orig 2025-12-04T09:17:11.4168109Z * [new branch] gh/mikaylagawarecki/386/base -> origin/gh/mikaylagawarecki/386/base 2025-12-04T09:17:11.4170456Z * [new branch] gh/mikaylagawarecki/386/head -> origin/gh/mikaylagawarecki/386/head 2025-12-04T09:17:11.4172523Z * [new branch] gh/mikaylagawarecki/386/orig -> origin/gh/mikaylagawarecki/386/orig 2025-12-04T09:17:11.4175592Z * [new branch] gh/mikaylagawarecki/387/base -> origin/gh/mikaylagawarecki/387/base 2025-12-04T09:17:11.4177516Z * [new branch] gh/mikaylagawarecki/387/head -> origin/gh/mikaylagawarecki/387/head 2025-12-04T09:17:11.4179559Z * [new branch] gh/mikaylagawarecki/387/orig -> origin/gh/mikaylagawarecki/387/orig 2025-12-04T09:17:11.4182232Z * [new branch] gh/mikaylagawarecki/388/base -> origin/gh/mikaylagawarecki/388/base 2025-12-04T09:17:11.4184319Z * [new branch] gh/mikaylagawarecki/388/head -> origin/gh/mikaylagawarecki/388/head 2025-12-04T09:17:11.4186421Z * [new branch] gh/mikaylagawarecki/388/orig -> origin/gh/mikaylagawarecki/388/orig 2025-12-04T09:17:11.4189309Z * [new branch] gh/mikaylagawarecki/389/base -> origin/gh/mikaylagawarecki/389/base 2025-12-04T09:17:11.4191371Z * [new branch] gh/mikaylagawarecki/389/head -> origin/gh/mikaylagawarecki/389/head 2025-12-04T09:17:11.4193414Z * [new branch] gh/mikaylagawarecki/389/orig -> origin/gh/mikaylagawarecki/389/orig 2025-12-04T09:17:11.4196237Z * [new branch] gh/mikaylagawarecki/390/base -> origin/gh/mikaylagawarecki/390/base 2025-12-04T09:17:11.4198372Z * [new branch] gh/mikaylagawarecki/390/head -> origin/gh/mikaylagawarecki/390/head 2025-12-04T09:17:11.4200987Z * [new branch] gh/mikaylagawarecki/390/orig -> origin/gh/mikaylagawarecki/390/orig 2025-12-04T09:17:11.4203895Z * [new branch] gh/mikaylagawarecki/391/base -> origin/gh/mikaylagawarecki/391/base 2025-12-04T09:17:11.4205927Z * [new branch] gh/mikaylagawarecki/391/head -> origin/gh/mikaylagawarecki/391/head 2025-12-04T09:17:11.4207976Z * [new branch] gh/mikaylagawarecki/391/orig -> origin/gh/mikaylagawarecki/391/orig 2025-12-04T09:17:11.4210949Z * [new branch] gh/mikaylagawarecki/392/base -> origin/gh/mikaylagawarecki/392/base 2025-12-04T09:17:11.4213441Z * [new branch] gh/mikaylagawarecki/392/head -> origin/gh/mikaylagawarecki/392/head 2025-12-04T09:17:11.4215965Z * [new branch] gh/mikaylagawarecki/392/orig -> origin/gh/mikaylagawarecki/392/orig 2025-12-04T09:17:11.4219178Z * [new branch] gh/mlazos/41/base -> origin/gh/mlazos/41/base 2025-12-04T09:17:11.4221070Z * [new branch] gh/mlazos/41/head -> origin/gh/mlazos/41/head 2025-12-04T09:17:11.4223005Z * [new branch] gh/mlazos/41/orig -> origin/gh/mlazos/41/orig 2025-12-04T09:17:11.4225883Z * [new branch] gh/mlazos/42/base -> origin/gh/mlazos/42/base 2025-12-04T09:17:11.4227861Z * [new branch] gh/mlazos/42/head -> origin/gh/mlazos/42/head 2025-12-04T09:17:11.4229924Z * [new branch] gh/mlazos/42/orig -> origin/gh/mlazos/42/orig 2025-12-04T09:17:11.4232447Z * [new branch] gh/mlazos/43/base -> origin/gh/mlazos/43/base 2025-12-04T09:17:11.4234689Z * [new branch] gh/mlazos/43/head -> origin/gh/mlazos/43/head 2025-12-04T09:17:11.4236722Z * [new branch] gh/mlazos/43/orig -> origin/gh/mlazos/43/orig 2025-12-04T09:17:11.4239366Z * [new branch] gh/mlazos/44/base -> origin/gh/mlazos/44/base 2025-12-04T09:17:11.4241520Z * [new branch] gh/mlazos/44/head -> origin/gh/mlazos/44/head 2025-12-04T09:17:11.4243567Z * [new branch] gh/mlazos/44/orig -> origin/gh/mlazos/44/orig 2025-12-04T09:17:11.4246296Z * [new branch] gh/mlazos/47/base -> origin/gh/mlazos/47/base 2025-12-04T09:17:11.4248360Z * [new branch] gh/mlazos/47/head -> origin/gh/mlazos/47/head 2025-12-04T09:17:11.4250389Z * [new branch] gh/mlazos/47/orig -> origin/gh/mlazos/47/orig 2025-12-04T09:17:11.4253128Z * [new branch] gh/mlazos/48/base -> origin/gh/mlazos/48/base 2025-12-04T09:17:11.4256137Z * [new branch] gh/mlazos/48/head -> origin/gh/mlazos/48/head 2025-12-04T09:17:11.4258113Z * [new branch] gh/mlazos/48/orig -> origin/gh/mlazos/48/orig 2025-12-04T09:17:11.4260489Z * [new branch] gh/mlazos/49/base -> origin/gh/mlazos/49/base 2025-12-04T09:17:11.4262518Z * [new branch] gh/mlazos/49/head -> origin/gh/mlazos/49/head 2025-12-04T09:17:11.4264754Z * [new branch] gh/mlazos/49/orig -> origin/gh/mlazos/49/orig 2025-12-04T09:17:11.4267331Z * [new branch] gh/mlazos/50/base -> origin/gh/mlazos/50/base 2025-12-04T09:17:11.4269358Z * [new branch] gh/mlazos/50/head -> origin/gh/mlazos/50/head 2025-12-04T09:17:11.4271424Z * [new branch] gh/mlazos/50/orig -> origin/gh/mlazos/50/orig 2025-12-04T09:17:11.4274010Z * [new branch] gh/mlazos/51/base -> origin/gh/mlazos/51/base 2025-12-04T09:17:11.4276023Z * [new branch] gh/mlazos/51/head -> origin/gh/mlazos/51/head 2025-12-04T09:17:11.4278061Z * [new branch] gh/mlazos/51/orig -> origin/gh/mlazos/51/orig 2025-12-04T09:17:11.4281124Z * [new branch] gh/mlazos/52/base -> origin/gh/mlazos/52/base 2025-12-04T09:17:11.4283184Z * [new branch] gh/mlazos/52/head -> origin/gh/mlazos/52/head 2025-12-04T09:17:11.4285225Z * [new branch] gh/mlazos/52/orig -> origin/gh/mlazos/52/orig 2025-12-04T09:17:11.4288047Z * [new branch] gh/mlazos/53/base -> origin/gh/mlazos/53/base 2025-12-04T09:17:11.4290061Z * [new branch] gh/mlazos/53/head -> origin/gh/mlazos/53/head 2025-12-04T09:17:11.4292071Z * [new branch] gh/mlazos/53/orig -> origin/gh/mlazos/53/orig 2025-12-04T09:17:11.4302379Z * [new branch] gh/mlazos/54/base -> origin/gh/mlazos/54/base 2025-12-04T09:17:11.4302977Z * [new branch] gh/mlazos/54/head -> origin/gh/mlazos/54/head 2025-12-04T09:17:11.4303478Z * [new branch] gh/mlazos/54/orig -> origin/gh/mlazos/54/orig 2025-12-04T09:17:11.4303966Z * [new branch] gh/mlazos/55/base -> origin/gh/mlazos/55/base 2025-12-04T09:17:11.4304622Z * [new branch] gh/mlazos/55/head -> origin/gh/mlazos/55/head 2025-12-04T09:17:11.4306527Z * [new branch] gh/mlazos/55/orig -> origin/gh/mlazos/55/orig 2025-12-04T09:17:11.4308988Z * [new branch] gh/mlazos/56/base -> origin/gh/mlazos/56/base 2025-12-04T09:17:11.4310893Z * [new branch] gh/mlazos/56/head -> origin/gh/mlazos/56/head 2025-12-04T09:17:11.4312954Z * [new branch] gh/mlazos/56/orig -> origin/gh/mlazos/56/orig 2025-12-04T09:17:11.4315332Z * [new branch] gh/mlazos/57/base -> origin/gh/mlazos/57/base 2025-12-04T09:17:11.4317271Z * [new branch] gh/mlazos/57/head -> origin/gh/mlazos/57/head 2025-12-04T09:17:11.4318944Z * [new branch] gh/mlazos/57/orig -> origin/gh/mlazos/57/orig 2025-12-04T09:17:11.4321758Z * [new branch] gh/mlazos/58/base -> origin/gh/mlazos/58/base 2025-12-04T09:17:11.4323628Z * [new branch] gh/mlazos/58/head -> origin/gh/mlazos/58/head 2025-12-04T09:17:11.4325491Z * [new branch] gh/mlazos/58/orig -> origin/gh/mlazos/58/orig 2025-12-04T09:17:11.4328478Z * [new branch] gh/mlazos/59/base -> origin/gh/mlazos/59/base 2025-12-04T09:17:11.4329898Z * [new branch] gh/mlazos/59/head -> origin/gh/mlazos/59/head 2025-12-04T09:17:11.4332370Z * [new branch] gh/mlazos/59/orig -> origin/gh/mlazos/59/orig 2025-12-04T09:17:11.4334989Z * [new branch] gh/mlazos/60/base -> origin/gh/mlazos/60/base 2025-12-04T09:17:11.4336900Z * [new branch] gh/mlazos/60/head -> origin/gh/mlazos/60/head 2025-12-04T09:17:11.4338669Z * [new branch] gh/mlazos/60/orig -> origin/gh/mlazos/60/orig 2025-12-04T09:17:11.4341694Z * [new branch] gh/mlazos/61/base -> origin/gh/mlazos/61/base 2025-12-04T09:17:11.4343714Z * [new branch] gh/mlazos/61/head -> origin/gh/mlazos/61/head 2025-12-04T09:17:11.4345462Z * [new branch] gh/mlazos/61/orig -> origin/gh/mlazos/61/orig 2025-12-04T09:17:11.4348064Z * [new branch] gh/mlazos/62/base -> origin/gh/mlazos/62/base 2025-12-04T09:17:11.4349983Z * [new branch] gh/mlazos/62/head -> origin/gh/mlazos/62/head 2025-12-04T09:17:11.4351878Z * [new branch] gh/mlazos/62/orig -> origin/gh/mlazos/62/orig 2025-12-04T09:17:11.4354573Z * [new branch] gh/mlazos/63/base -> origin/gh/mlazos/63/base 2025-12-04T09:17:11.4356595Z * [new branch] gh/mlazos/63/head -> origin/gh/mlazos/63/head 2025-12-04T09:17:11.4358532Z * [new branch] gh/mlazos/63/orig -> origin/gh/mlazos/63/orig 2025-12-04T09:17:11.4361682Z * [new branch] gh/mlazos/64/base -> origin/gh/mlazos/64/base 2025-12-04T09:17:11.4363669Z * [new branch] gh/mlazos/64/head -> origin/gh/mlazos/64/head 2025-12-04T09:17:11.4365633Z * [new branch] gh/mlazos/64/orig -> origin/gh/mlazos/64/orig 2025-12-04T09:17:11.4368081Z * [new branch] gh/mlazos/65/base -> origin/gh/mlazos/65/base 2025-12-04T09:17:11.4370353Z * [new branch] gh/mlazos/65/head -> origin/gh/mlazos/65/head 2025-12-04T09:17:11.4371706Z * [new branch] gh/mlazos/65/orig -> origin/gh/mlazos/65/orig 2025-12-04T09:17:11.4374375Z * [new branch] gh/mlazos/66/base -> origin/gh/mlazos/66/base 2025-12-04T09:17:11.4376225Z * [new branch] gh/mlazos/66/head -> origin/gh/mlazos/66/head 2025-12-04T09:17:11.4378094Z * [new branch] gh/mlazos/66/orig -> origin/gh/mlazos/66/orig 2025-12-04T09:17:11.4380770Z * [new branch] gh/mlazos/67/base -> origin/gh/mlazos/67/base 2025-12-04T09:17:11.4382619Z * [new branch] gh/mlazos/67/head -> origin/gh/mlazos/67/head 2025-12-04T09:17:11.4384492Z * [new branch] gh/mlazos/67/orig -> origin/gh/mlazos/67/orig 2025-12-04T09:17:11.4387093Z * [new branch] gh/mlazos/68/base -> origin/gh/mlazos/68/base 2025-12-04T09:17:11.4389021Z * [new branch] gh/mlazos/68/head -> origin/gh/mlazos/68/head 2025-12-04T09:17:11.4390990Z * [new branch] gh/mlazos/68/orig -> origin/gh/mlazos/68/orig 2025-12-04T09:17:11.4393625Z * [new branch] gh/mlazos/69/base -> origin/gh/mlazos/69/base 2025-12-04T09:17:11.4395477Z * [new branch] gh/mlazos/69/head -> origin/gh/mlazos/69/head 2025-12-04T09:17:11.4397338Z * [new branch] gh/mlazos/69/orig -> origin/gh/mlazos/69/orig 2025-12-04T09:17:11.4400377Z * [new branch] gh/mlazos/70/base -> origin/gh/mlazos/70/base 2025-12-04T09:17:11.4402627Z * [new branch] gh/mlazos/70/head -> origin/gh/mlazos/70/head 2025-12-04T09:17:11.4404332Z * [new branch] gh/mlazos/70/orig -> origin/gh/mlazos/70/orig 2025-12-04T09:17:11.4407476Z * [new branch] gh/mlazos/71/base -> origin/gh/mlazos/71/base 2025-12-04T09:17:11.4410304Z * [new branch] gh/mlazos/71/head -> origin/gh/mlazos/71/head 2025-12-04T09:17:11.4410800Z * [new branch] gh/mlazos/71/orig -> origin/gh/mlazos/71/orig 2025-12-04T09:17:11.4413114Z * [new branch] gh/mlazos/72/base -> origin/gh/mlazos/72/base 2025-12-04T09:17:11.4415407Z * [new branch] gh/mlazos/72/head -> origin/gh/mlazos/72/head 2025-12-04T09:17:11.4416626Z * [new branch] gh/mlazos/72/orig -> origin/gh/mlazos/72/orig 2025-12-04T09:17:11.4419522Z * [new branch] gh/mlazos/73/base -> origin/gh/mlazos/73/base 2025-12-04T09:17:11.4421485Z * [new branch] gh/mlazos/73/head -> origin/gh/mlazos/73/head 2025-12-04T09:17:11.4423396Z * [new branch] gh/mlazos/73/orig -> origin/gh/mlazos/73/orig 2025-12-04T09:17:11.4426419Z * [new branch] gh/mrmiywj/1/base -> origin/gh/mrmiywj/1/base 2025-12-04T09:17:11.4428364Z * [new branch] gh/mrmiywj/1/head -> origin/gh/mrmiywj/1/head 2025-12-04T09:17:11.4431779Z * [new branch] gh/muchulee8/73/base -> origin/gh/muchulee8/73/base 2025-12-04T09:17:11.4433456Z * [new branch] gh/muchulee8/73/head -> origin/gh/muchulee8/73/head 2025-12-04T09:17:11.4435484Z * [new branch] gh/muchulee8/73/orig -> origin/gh/muchulee8/73/orig 2025-12-04T09:17:11.4438671Z * [new branch] gh/naveenthangudu/1/base -> origin/gh/naveenthangudu/1/base 2025-12-04T09:17:11.4440703Z * [new branch] gh/naveenthangudu/1/head -> origin/gh/naveenthangudu/1/head 2025-12-04T09:17:11.4442563Z * [new branch] gh/naveenthangudu/1/orig -> origin/gh/naveenthangudu/1/orig 2025-12-04T09:17:11.4445211Z * [new branch] gh/naveenthangudu/2/base -> origin/gh/naveenthangudu/2/base 2025-12-04T09:17:11.4446931Z * [new branch] gh/naveenthangudu/2/head -> origin/gh/naveenthangudu/2/head 2025-12-04T09:17:11.4448876Z * [new branch] gh/naveenthangudu/2/orig -> origin/gh/naveenthangudu/2/orig 2025-12-04T09:17:11.4451279Z * [new branch] gh/naveenthangudu/3/base -> origin/gh/naveenthangudu/3/base 2025-12-04T09:17:11.4453178Z * [new branch] gh/naveenthangudu/3/head -> origin/gh/naveenthangudu/3/head 2025-12-04T09:17:11.4455073Z * [new branch] gh/naveenthangudu/3/orig -> origin/gh/naveenthangudu/3/orig 2025-12-04T09:17:11.4457531Z * [new branch] gh/naveenthangudu/4/base -> origin/gh/naveenthangudu/4/base 2025-12-04T09:17:11.4459435Z * [new branch] gh/naveenthangudu/4/head -> origin/gh/naveenthangudu/4/head 2025-12-04T09:17:11.4461369Z * [new branch] gh/naveenthangudu/4/orig -> origin/gh/naveenthangudu/4/orig 2025-12-04T09:17:11.4464149Z * [new branch] gh/naveenthangudu/5/base -> origin/gh/naveenthangudu/5/base 2025-12-04T09:17:11.4465985Z * [new branch] gh/naveenthangudu/5/head -> origin/gh/naveenthangudu/5/head 2025-12-04T09:17:11.4467959Z * [new branch] gh/naveenthangudu/5/orig -> origin/gh/naveenthangudu/5/orig 2025-12-04T09:17:11.4470455Z * [new branch] gh/naveenthangudu/6/base -> origin/gh/naveenthangudu/6/base 2025-12-04T09:17:11.4472297Z * [new branch] gh/naveenthangudu/6/head -> origin/gh/naveenthangudu/6/head 2025-12-04T09:17:11.4474044Z * [new branch] gh/naveenthangudu/6/orig -> origin/gh/naveenthangudu/6/orig 2025-12-04T09:17:11.4476570Z * [new branch] gh/naveenthangudu/7/base -> origin/gh/naveenthangudu/7/base 2025-12-04T09:17:11.4478514Z * [new branch] gh/naveenthangudu/7/head -> origin/gh/naveenthangudu/7/head 2025-12-04T09:17:11.4480374Z * [new branch] gh/naveenthangudu/7/orig -> origin/gh/naveenthangudu/7/orig 2025-12-04T09:17:11.4482794Z * [new branch] gh/naveenthangudu/8/base -> origin/gh/naveenthangudu/8/base 2025-12-04T09:17:11.4484814Z * [new branch] gh/naveenthangudu/8/head -> origin/gh/naveenthangudu/8/head 2025-12-04T09:17:11.4486699Z * [new branch] gh/naveenthangudu/8/orig -> origin/gh/naveenthangudu/8/orig 2025-12-04T09:17:11.4489335Z * [new branch] gh/naveenthangudu/9/base -> origin/gh/naveenthangudu/9/base 2025-12-04T09:17:11.4491207Z * [new branch] gh/naveenthangudu/9/head -> origin/gh/naveenthangudu/9/head 2025-12-04T09:17:11.4493072Z * [new branch] gh/naveenthangudu/9/orig -> origin/gh/naveenthangudu/9/orig 2025-12-04T09:17:11.4495979Z * [new branch] gh/nikitaved/1/base -> origin/gh/nikitaved/1/base 2025-12-04T09:17:11.4497911Z * [new branch] gh/nikitaved/1/head -> origin/gh/nikitaved/1/head 2025-12-04T09:17:11.4499807Z * [new branch] gh/nikitaved/1/orig -> origin/gh/nikitaved/1/orig 2025-12-04T09:17:11.4502655Z * [new branch] gh/nikitaved/10/base -> origin/gh/nikitaved/10/base 2025-12-04T09:17:11.4504833Z * [new branch] gh/nikitaved/10/head -> origin/gh/nikitaved/10/head 2025-12-04T09:17:11.4506541Z * [new branch] gh/nikitaved/10/orig -> origin/gh/nikitaved/10/orig 2025-12-04T09:17:11.4508892Z * [new branch] gh/nikitaved/11/base -> origin/gh/nikitaved/11/base 2025-12-04T09:17:11.4510863Z * [new branch] gh/nikitaved/11/head -> origin/gh/nikitaved/11/head 2025-12-04T09:17:11.4512672Z * [new branch] gh/nikitaved/11/orig -> origin/gh/nikitaved/11/orig 2025-12-04T09:17:11.4515292Z * [new branch] gh/nikitaved/12/base -> origin/gh/nikitaved/12/base 2025-12-04T09:17:11.4517178Z * [new branch] gh/nikitaved/12/head -> origin/gh/nikitaved/12/head 2025-12-04T09:17:11.4519152Z * [new branch] gh/nikitaved/12/orig -> origin/gh/nikitaved/12/orig 2025-12-04T09:17:11.4521883Z * [new branch] gh/nikitaved/13/base -> origin/gh/nikitaved/13/base 2025-12-04T09:17:11.4524036Z * [new branch] gh/nikitaved/13/head -> origin/gh/nikitaved/13/head 2025-12-04T09:17:11.4525771Z * [new branch] gh/nikitaved/13/orig -> origin/gh/nikitaved/13/orig 2025-12-04T09:17:11.4528187Z * [new branch] gh/nikitaved/14/base -> origin/gh/nikitaved/14/base 2025-12-04T09:17:11.4530194Z * [new branch] gh/nikitaved/14/head -> origin/gh/nikitaved/14/head 2025-12-04T09:17:11.4532057Z * [new branch] gh/nikitaved/14/orig -> origin/gh/nikitaved/14/orig 2025-12-04T09:17:11.4534313Z * [new branch] gh/nikitaved/15/base -> origin/gh/nikitaved/15/base 2025-12-04T09:17:11.4536263Z * [new branch] gh/nikitaved/15/head -> origin/gh/nikitaved/15/head 2025-12-04T09:17:11.4538050Z * [new branch] gh/nikitaved/15/orig -> origin/gh/nikitaved/15/orig 2025-12-04T09:17:11.4540736Z * [new branch] gh/nikitaved/16/base -> origin/gh/nikitaved/16/base 2025-12-04T09:17:11.4542716Z * [new branch] gh/nikitaved/16/head -> origin/gh/nikitaved/16/head 2025-12-04T09:17:11.4544621Z * [new branch] gh/nikitaved/16/orig -> origin/gh/nikitaved/16/orig 2025-12-04T09:17:11.4547072Z * [new branch] gh/nikitaved/2/base -> origin/gh/nikitaved/2/base 2025-12-04T09:17:11.4549121Z * [new branch] gh/nikitaved/2/head -> origin/gh/nikitaved/2/head 2025-12-04T09:17:11.4551114Z * [new branch] gh/nikitaved/2/orig -> origin/gh/nikitaved/2/orig 2025-12-04T09:17:11.4553321Z * [new branch] gh/nikitaved/4/base -> origin/gh/nikitaved/4/base 2025-12-04T09:17:11.4555157Z * [new branch] gh/nikitaved/4/head -> origin/gh/nikitaved/4/head 2025-12-04T09:17:11.4557003Z * [new branch] gh/nikitaved/4/orig -> origin/gh/nikitaved/4/orig 2025-12-04T09:17:11.4559598Z * [new branch] gh/nikitaved/5/base -> origin/gh/nikitaved/5/base 2025-12-04T09:17:11.4561637Z * [new branch] gh/nikitaved/5/head -> origin/gh/nikitaved/5/head 2025-12-04T09:17:11.4563718Z * [new branch] gh/nikitaved/5/orig -> origin/gh/nikitaved/5/orig 2025-12-04T09:17:11.4566283Z * [new branch] gh/nikitaved/6/base -> origin/gh/nikitaved/6/base 2025-12-04T09:17:11.4567986Z * [new branch] gh/nikitaved/6/head -> origin/gh/nikitaved/6/head 2025-12-04T09:17:11.4569851Z * [new branch] gh/nikitaved/6/orig -> origin/gh/nikitaved/6/orig 2025-12-04T09:17:11.4572407Z * [new branch] gh/nikitaved/8/base -> origin/gh/nikitaved/8/base 2025-12-04T09:17:11.4574495Z * [new branch] gh/nikitaved/8/head -> origin/gh/nikitaved/8/head 2025-12-04T09:17:11.4576483Z * [new branch] gh/nikitaved/8/orig -> origin/gh/nikitaved/8/orig 2025-12-04T09:17:11.4578807Z * [new branch] gh/nikitaved/9/base -> origin/gh/nikitaved/9/base 2025-12-04T09:17:11.4580783Z * [new branch] gh/nikitaved/9/head -> origin/gh/nikitaved/9/head 2025-12-04T09:17:11.4582633Z * [new branch] gh/nikitaved/9/orig -> origin/gh/nikitaved/9/orig 2025-12-04T09:17:11.4585522Z * [new branch] gh/oulgen/10/base -> origin/gh/oulgen/10/base 2025-12-04T09:17:11.4587383Z * [new branch] gh/oulgen/10/head -> origin/gh/oulgen/10/head 2025-12-04T09:17:11.4589385Z * [new branch] gh/oulgen/10/orig -> origin/gh/oulgen/10/orig 2025-12-04T09:17:11.4591808Z * [new branch] gh/oulgen/11/base -> origin/gh/oulgen/11/base 2025-12-04T09:17:11.4593637Z * [new branch] gh/oulgen/11/head -> origin/gh/oulgen/11/head 2025-12-04T09:17:11.4595440Z * [new branch] gh/oulgen/11/orig -> origin/gh/oulgen/11/orig 2025-12-04T09:17:11.4597935Z * [new branch] gh/oulgen/12/base -> origin/gh/oulgen/12/base 2025-12-04T09:17:11.4600006Z * [new branch] gh/oulgen/12/head -> origin/gh/oulgen/12/head 2025-12-04T09:17:11.4602457Z * [new branch] gh/oulgen/12/orig -> origin/gh/oulgen/12/orig 2025-12-04T09:17:11.4604861Z * [new branch] gh/oulgen/13/base -> origin/gh/oulgen/13/base 2025-12-04T09:17:11.4606841Z * [new branch] gh/oulgen/13/head -> origin/gh/oulgen/13/head 2025-12-04T09:17:11.4608788Z * [new branch] gh/oulgen/13/orig -> origin/gh/oulgen/13/orig 2025-12-04T09:17:11.4611169Z * [new branch] gh/oulgen/14/base -> origin/gh/oulgen/14/base 2025-12-04T09:17:11.4613159Z * [new branch] gh/oulgen/14/head -> origin/gh/oulgen/14/head 2025-12-04T09:17:11.4615442Z * [new branch] gh/oulgen/14/orig -> origin/gh/oulgen/14/orig 2025-12-04T09:17:11.4617662Z * [new branch] gh/oulgen/15/base -> origin/gh/oulgen/15/base 2025-12-04T09:17:11.4619553Z * [new branch] gh/oulgen/15/head -> origin/gh/oulgen/15/head 2025-12-04T09:17:11.4621418Z * [new branch] gh/oulgen/15/orig -> origin/gh/oulgen/15/orig 2025-12-04T09:17:11.4624422Z * [new branch] gh/oulgen/16/base -> origin/gh/oulgen/16/base 2025-12-04T09:17:11.4626235Z * [new branch] gh/oulgen/16/head -> origin/gh/oulgen/16/head 2025-12-04T09:17:11.4628127Z * [new branch] gh/oulgen/16/orig -> origin/gh/oulgen/16/orig 2025-12-04T09:17:11.4630689Z * [new branch] gh/oulgen/17/base -> origin/gh/oulgen/17/base 2025-12-04T09:17:11.4632543Z * [new branch] gh/oulgen/17/head -> origin/gh/oulgen/17/head 2025-12-04T09:17:11.4634738Z * [new branch] gh/oulgen/17/orig -> origin/gh/oulgen/17/orig 2025-12-04T09:17:11.4636972Z * [new branch] gh/oulgen/18/base -> origin/gh/oulgen/18/base 2025-12-04T09:17:11.4638897Z * [new branch] gh/oulgen/18/head -> origin/gh/oulgen/18/head 2025-12-04T09:17:11.4641294Z * [new branch] gh/oulgen/18/orig -> origin/gh/oulgen/18/orig 2025-12-04T09:17:11.4643599Z * [new branch] gh/oulgen/19/base -> origin/gh/oulgen/19/base 2025-12-04T09:17:11.4645455Z * [new branch] gh/oulgen/19/head -> origin/gh/oulgen/19/head 2025-12-04T09:17:11.4647315Z * [new branch] gh/oulgen/19/orig -> origin/gh/oulgen/19/orig 2025-12-04T09:17:11.4649832Z * [new branch] gh/oulgen/20/base -> origin/gh/oulgen/20/base 2025-12-04T09:17:11.4651762Z * [new branch] gh/oulgen/20/head -> origin/gh/oulgen/20/head 2025-12-04T09:17:11.4653573Z * [new branch] gh/oulgen/20/orig -> origin/gh/oulgen/20/orig 2025-12-04T09:17:11.4655999Z * [new branch] gh/oulgen/21/base -> origin/gh/oulgen/21/base 2025-12-04T09:17:11.4657839Z * [new branch] gh/oulgen/21/head -> origin/gh/oulgen/21/head 2025-12-04T09:17:11.4659712Z * [new branch] gh/oulgen/21/orig -> origin/gh/oulgen/21/orig 2025-12-04T09:17:11.4662215Z * [new branch] gh/oulgen/22/base -> origin/gh/oulgen/22/base 2025-12-04T09:17:11.4664204Z * [new branch] gh/oulgen/22/head -> origin/gh/oulgen/22/head 2025-12-04T09:17:11.4666014Z * [new branch] gh/oulgen/22/orig -> origin/gh/oulgen/22/orig 2025-12-04T09:17:11.4668536Z * [new branch] gh/oulgen/23/base -> origin/gh/oulgen/23/base 2025-12-04T09:17:11.4670446Z * [new branch] gh/oulgen/23/head -> origin/gh/oulgen/23/head 2025-12-04T09:17:11.4672213Z * [new branch] gh/oulgen/23/orig -> origin/gh/oulgen/23/orig 2025-12-04T09:17:11.4674628Z * [new branch] gh/oulgen/24/base -> origin/gh/oulgen/24/base 2025-12-04T09:17:11.4676458Z * [new branch] gh/oulgen/24/head -> origin/gh/oulgen/24/head 2025-12-04T09:17:11.4678316Z * [new branch] gh/oulgen/24/orig -> origin/gh/oulgen/24/orig 2025-12-04T09:17:11.4681027Z * [new branch] gh/oulgen/25/base -> origin/gh/oulgen/25/base 2025-12-04T09:17:11.4683445Z * [new branch] gh/oulgen/25/head -> origin/gh/oulgen/25/head 2025-12-04T09:17:11.4685269Z * [new branch] gh/oulgen/25/orig -> origin/gh/oulgen/25/orig 2025-12-04T09:17:11.4687770Z * [new branch] gh/oulgen/26/base -> origin/gh/oulgen/26/base 2025-12-04T09:17:11.4689716Z * [new branch] gh/oulgen/26/head -> origin/gh/oulgen/26/head 2025-12-04T09:17:11.4691980Z * [new branch] gh/oulgen/26/orig -> origin/gh/oulgen/26/orig 2025-12-04T09:17:11.4694756Z * [new branch] gh/oulgen/4/base -> origin/gh/oulgen/4/base 2025-12-04T09:17:11.4696630Z * [new branch] gh/oulgen/4/head -> origin/gh/oulgen/4/head 2025-12-04T09:17:11.4698634Z * [new branch] gh/oulgen/4/orig -> origin/gh/oulgen/4/orig 2025-12-04T09:17:11.4702111Z * [new branch] gh/oulgen/7/base -> origin/gh/oulgen/7/base 2025-12-04T09:17:11.4704036Z * [new branch] gh/oulgen/7/head -> origin/gh/oulgen/7/head 2025-12-04T09:17:11.4706381Z * [new branch] gh/oulgen/7/orig -> origin/gh/oulgen/7/orig 2025-12-04T09:17:11.4708973Z * [new branch] gh/oulgen/8/base -> origin/gh/oulgen/8/base 2025-12-04T09:17:11.4711371Z * [new branch] gh/oulgen/8/head -> origin/gh/oulgen/8/head 2025-12-04T09:17:11.4713237Z * [new branch] gh/oulgen/8/orig -> origin/gh/oulgen/8/orig 2025-12-04T09:17:11.4715746Z * [new branch] gh/oulgen/9/base -> origin/gh/oulgen/9/base 2025-12-04T09:17:11.4717584Z * [new branch] gh/oulgen/9/head -> origin/gh/oulgen/9/head 2025-12-04T09:17:11.4719767Z * [new branch] gh/oulgen/9/orig -> origin/gh/oulgen/9/orig 2025-12-04T09:17:11.4722349Z * [new branch] gh/patvig/mtia-serialization -> origin/gh/patvig/mtia-serialization 2025-12-04T09:17:11.4725591Z * [new branch] gh/pearu/108/base -> origin/gh/pearu/108/base 2025-12-04T09:17:11.4727546Z * [new branch] gh/pearu/108/head -> origin/gh/pearu/108/head 2025-12-04T09:17:11.4729554Z * [new branch] gh/pearu/108/orig -> origin/gh/pearu/108/orig 2025-12-04T09:17:11.4732071Z * [new branch] gh/pearu/109/base -> origin/gh/pearu/109/base 2025-12-04T09:17:11.4733922Z * [new branch] gh/pearu/109/head -> origin/gh/pearu/109/head 2025-12-04T09:17:11.4735679Z * [new branch] gh/pearu/109/orig -> origin/gh/pearu/109/orig 2025-12-04T09:17:11.4738209Z * [new branch] gh/pearu/110/base -> origin/gh/pearu/110/base 2025-12-04T09:17:11.4740515Z * [new branch] gh/pearu/110/head -> origin/gh/pearu/110/head 2025-12-04T09:17:11.4742396Z * [new branch] gh/pearu/110/orig -> origin/gh/pearu/110/orig 2025-12-04T09:17:11.4744975Z * [new branch] gh/pearu/111/base -> origin/gh/pearu/111/base 2025-12-04T09:17:11.4747344Z * [new branch] gh/pearu/111/head -> origin/gh/pearu/111/head 2025-12-04T09:17:11.4749479Z * [new branch] gh/pearu/111/orig -> origin/gh/pearu/111/orig 2025-12-04T09:17:11.4752024Z * [new branch] gh/pearu/112/base -> origin/gh/pearu/112/base 2025-12-04T09:17:11.4753824Z * [new branch] gh/pearu/112/head -> origin/gh/pearu/112/head 2025-12-04T09:17:11.4755719Z * [new branch] gh/pearu/112/orig -> origin/gh/pearu/112/orig 2025-12-04T09:17:11.4758142Z * [new branch] gh/pearu/115/base -> origin/gh/pearu/115/base 2025-12-04T09:17:11.4760209Z * [new branch] gh/pearu/115/head -> origin/gh/pearu/115/head 2025-12-04T09:17:11.4762037Z * [new branch] gh/pearu/115/orig -> origin/gh/pearu/115/orig 2025-12-04T09:17:11.4764449Z * [new branch] gh/pearu/116/base -> origin/gh/pearu/116/base 2025-12-04T09:17:11.4766405Z * [new branch] gh/pearu/116/head -> origin/gh/pearu/116/head 2025-12-04T09:17:11.4768246Z * [new branch] gh/pearu/116/orig -> origin/gh/pearu/116/orig 2025-12-04T09:17:11.4770903Z * [new branch] gh/pearu/117/base -> origin/gh/pearu/117/base 2025-12-04T09:17:11.4772688Z * [new branch] gh/pearu/117/head -> origin/gh/pearu/117/head 2025-12-04T09:17:11.4774589Z * [new branch] gh/pearu/117/orig -> origin/gh/pearu/117/orig 2025-12-04T09:17:11.4777076Z * [new branch] gh/pearu/118/base -> origin/gh/pearu/118/base 2025-12-04T09:17:11.4779007Z * [new branch] gh/pearu/118/head -> origin/gh/pearu/118/head 2025-12-04T09:17:11.4780841Z * [new branch] gh/pearu/118/orig -> origin/gh/pearu/118/orig 2025-12-04T09:17:11.4783353Z * [new branch] gh/pearu/119/base -> origin/gh/pearu/119/base 2025-12-04T09:17:11.4785190Z * [new branch] gh/pearu/119/head -> origin/gh/pearu/119/head 2025-12-04T09:17:11.4787017Z * [new branch] gh/pearu/119/orig -> origin/gh/pearu/119/orig 2025-12-04T09:17:11.4789581Z * [new branch] gh/pearu/139/base -> origin/gh/pearu/139/base 2025-12-04T09:17:11.4791466Z * [new branch] gh/pearu/139/head -> origin/gh/pearu/139/head 2025-12-04T09:17:11.4793293Z * [new branch] gh/pearu/139/orig -> origin/gh/pearu/139/orig 2025-12-04T09:17:11.4795972Z * [new branch] gh/pearu/140/base -> origin/gh/pearu/140/base 2025-12-04T09:17:11.4797979Z * [new branch] gh/pearu/140/head -> origin/gh/pearu/140/head 2025-12-04T09:17:11.4799866Z * [new branch] gh/pearu/140/orig -> origin/gh/pearu/140/orig 2025-12-04T09:17:11.4802973Z * [new branch] gh/pearu/142/base -> origin/gh/pearu/142/base 2025-12-04T09:17:11.4804845Z * [new branch] gh/pearu/142/head -> origin/gh/pearu/142/head 2025-12-04T09:17:11.4806623Z * [new branch] gh/pearu/142/orig -> origin/gh/pearu/142/orig 2025-12-04T09:17:11.4809104Z * [new branch] gh/pearu/143/base -> origin/gh/pearu/143/base 2025-12-04T09:17:11.4810967Z * [new branch] gh/pearu/143/head -> origin/gh/pearu/143/head 2025-12-04T09:17:11.4812963Z * [new branch] gh/pearu/143/orig -> origin/gh/pearu/143/orig 2025-12-04T09:17:11.4815346Z * [new branch] gh/pearu/147/base -> origin/gh/pearu/147/base 2025-12-04T09:17:11.4817144Z * [new branch] gh/pearu/147/head -> origin/gh/pearu/147/head 2025-12-04T09:17:11.4819125Z * [new branch] gh/pearu/147/orig -> origin/gh/pearu/147/orig 2025-12-04T09:17:11.4821626Z * [new branch] gh/pearu/149/base -> origin/gh/pearu/149/base 2025-12-04T09:17:11.4823449Z * [new branch] gh/pearu/149/head -> origin/gh/pearu/149/head 2025-12-04T09:17:11.4825283Z * [new branch] gh/pearu/149/orig -> origin/gh/pearu/149/orig 2025-12-04T09:17:11.4828464Z * [new branch] gh/pearu/150/base -> origin/gh/pearu/150/base 2025-12-04T09:17:11.4830434Z * [new branch] gh/pearu/150/head -> origin/gh/pearu/150/head 2025-12-04T09:17:11.4832158Z * [new branch] gh/pearu/150/orig -> origin/gh/pearu/150/orig 2025-12-04T09:17:11.4834735Z * [new branch] gh/pearu/151/base -> origin/gh/pearu/151/base 2025-12-04T09:17:11.4836569Z * [new branch] gh/pearu/151/head -> origin/gh/pearu/151/head 2025-12-04T09:17:11.4838726Z * [new branch] gh/pearu/151/orig -> origin/gh/pearu/151/orig 2025-12-04T09:17:11.4842111Z * [new branch] gh/pearu/152/base -> origin/gh/pearu/152/base 2025-12-04T09:17:11.4844003Z * [new branch] gh/pearu/152/head -> origin/gh/pearu/152/head 2025-12-04T09:17:11.4845839Z * [new branch] gh/pearu/152/orig -> origin/gh/pearu/152/orig 2025-12-04T09:17:11.4848508Z * [new branch] gh/pearu/153/base -> origin/gh/pearu/153/base 2025-12-04T09:17:11.4850411Z * [new branch] gh/pearu/153/head -> origin/gh/pearu/153/head 2025-12-04T09:17:11.4852149Z * [new branch] gh/pearu/153/orig -> origin/gh/pearu/153/orig 2025-12-04T09:17:11.4854701Z * [new branch] gh/pearu/154/base -> origin/gh/pearu/154/base 2025-12-04T09:17:11.4856532Z * [new branch] gh/pearu/154/head -> origin/gh/pearu/154/head 2025-12-04T09:17:11.4858358Z * [new branch] gh/pearu/154/orig -> origin/gh/pearu/154/orig 2025-12-04T09:17:11.4861052Z * [new branch] gh/pearu/155/base -> origin/gh/pearu/155/base 2025-12-04T09:17:11.4862858Z * [new branch] gh/pearu/155/head -> origin/gh/pearu/155/head 2025-12-04T09:17:11.4864699Z * [new branch] gh/pearu/155/orig -> origin/gh/pearu/155/orig 2025-12-04T09:17:11.4867187Z * [new branch] gh/pearu/156/base -> origin/gh/pearu/156/base 2025-12-04T09:17:11.4869063Z * [new branch] gh/pearu/156/head -> origin/gh/pearu/156/head 2025-12-04T09:17:11.4870850Z * [new branch] gh/pearu/156/orig -> origin/gh/pearu/156/orig 2025-12-04T09:17:11.4873980Z * [new branch] gh/pearu/56/base -> origin/gh/pearu/56/base 2025-12-04T09:17:11.4876191Z * [new branch] gh/pearu/56/head -> origin/gh/pearu/56/head 2025-12-04T09:17:11.4877985Z * [new branch] gh/pearu/56/orig -> origin/gh/pearu/56/orig 2025-12-04T09:17:11.4880982Z * [new branch] gh/pearu/97/base -> origin/gh/pearu/97/base 2025-12-04T09:17:11.4882987Z * [new branch] gh/pearu/97/head -> origin/gh/pearu/97/head 2025-12-04T09:17:11.4884726Z * [new branch] gh/pearu/97/orig -> origin/gh/pearu/97/orig 2025-12-04T09:17:11.4888375Z * [new branch] gh/pianpwk/21/base -> origin/gh/pianpwk/21/base 2025-12-04T09:17:11.4890322Z * [new branch] gh/pianpwk/21/head -> origin/gh/pianpwk/21/head 2025-12-04T09:17:11.4892923Z * [new branch] gh/pianpwk/28/base -> origin/gh/pianpwk/28/base 2025-12-04T09:17:11.4894806Z * [new branch] gh/pianpwk/28/head -> origin/gh/pianpwk/28/head 2025-12-04T09:17:11.4896687Z * [new branch] gh/pianpwk/28/orig -> origin/gh/pianpwk/28/orig 2025-12-04T09:17:11.4899294Z * [new branch] gh/pianpwk/29/base -> origin/gh/pianpwk/29/base 2025-12-04T09:17:11.4901453Z * [new branch] gh/pianpwk/29/head -> origin/gh/pianpwk/29/head 2025-12-04T09:17:11.4904836Z * [new branch] gh/pianpwk/29/orig -> origin/gh/pianpwk/29/orig 2025-12-04T09:17:11.4907400Z * [new branch] gh/pianpwk/30/base -> origin/gh/pianpwk/30/base 2025-12-04T09:17:11.4909176Z * [new branch] gh/pianpwk/30/head -> origin/gh/pianpwk/30/head 2025-12-04T09:17:11.4911010Z * [new branch] gh/pianpwk/30/orig -> origin/gh/pianpwk/30/orig 2025-12-04T09:17:11.4913531Z * [new branch] gh/pianpwk/31/base -> origin/gh/pianpwk/31/base 2025-12-04T09:17:11.4915350Z * [new branch] gh/pianpwk/31/head -> origin/gh/pianpwk/31/head 2025-12-04T09:17:11.4917196Z * [new branch] gh/pianpwk/31/orig -> origin/gh/pianpwk/31/orig 2025-12-04T09:17:11.4919629Z * [new branch] gh/pianpwk/32/base -> origin/gh/pianpwk/32/base 2025-12-04T09:17:11.4921527Z * [new branch] gh/pianpwk/32/head -> origin/gh/pianpwk/32/head 2025-12-04T09:17:11.4923325Z * [new branch] gh/pianpwk/32/orig -> origin/gh/pianpwk/32/orig 2025-12-04T09:17:11.4925756Z * [new branch] gh/pianpwk/33/base -> origin/gh/pianpwk/33/base 2025-12-04T09:17:11.4927619Z * [new branch] gh/pianpwk/33/head -> origin/gh/pianpwk/33/head 2025-12-04T09:17:11.4929449Z * [new branch] gh/pianpwk/33/orig -> origin/gh/pianpwk/33/orig 2025-12-04T09:17:11.4932216Z * [new branch] gh/pianpwk/34/base -> origin/gh/pianpwk/34/base 2025-12-04T09:17:11.4934324Z * [new branch] gh/pianpwk/34/head -> origin/gh/pianpwk/34/head 2025-12-04T09:17:11.4936326Z * [new branch] gh/pianpwk/34/orig -> origin/gh/pianpwk/34/orig 2025-12-04T09:17:11.4938832Z * [new branch] gh/pianpwk/35/base -> origin/gh/pianpwk/35/base 2025-12-04T09:17:11.4940835Z * [new branch] gh/pianpwk/35/head -> origin/gh/pianpwk/35/head 2025-12-04T09:17:11.4942723Z * [new branch] gh/pianpwk/35/orig -> origin/gh/pianpwk/35/orig 2025-12-04T09:17:11.4945681Z * [new branch] gh/rec/141/base -> origin/gh/rec/141/base 2025-12-04T09:17:11.4947653Z * [new branch] gh/rec/141/head -> origin/gh/rec/141/head 2025-12-04T09:17:11.4950243Z * [new branch] gh/rec/153/base -> origin/gh/rec/153/base 2025-12-04T09:17:11.4951987Z * [new branch] gh/rec/153/head -> origin/gh/rec/153/head 2025-12-04T09:17:11.4953804Z * [new branch] gh/rec/153/orig -> origin/gh/rec/153/orig 2025-12-04T09:17:11.4956511Z * [new branch] gh/rec/154/base -> origin/gh/rec/154/base 2025-12-04T09:17:11.4958097Z * [new branch] gh/rec/154/head -> origin/gh/rec/154/head 2025-12-04T09:17:11.4960055Z * [new branch] gh/rec/154/orig -> origin/gh/rec/154/orig 2025-12-04T09:17:11.4962628Z * [new branch] gh/rec/164/base -> origin/gh/rec/164/base 2025-12-04T09:17:11.4964458Z * [new branch] gh/rec/164/head -> origin/gh/rec/164/head 2025-12-04T09:17:11.4966191Z * [new branch] gh/rec/164/orig -> origin/gh/rec/164/orig 2025-12-04T09:17:11.4968727Z * [new branch] gh/rec/166/base -> origin/gh/rec/166/base 2025-12-04T09:17:11.4970542Z * [new branch] gh/rec/166/head -> origin/gh/rec/166/head 2025-12-04T09:17:11.4972376Z * [new branch] gh/rec/166/orig -> origin/gh/rec/166/orig 2025-12-04T09:17:11.4974956Z * [new branch] gh/rec/167/base -> origin/gh/rec/167/base 2025-12-04T09:17:11.4976756Z * [new branch] gh/rec/167/head -> origin/gh/rec/167/head 2025-12-04T09:17:11.4978712Z * [new branch] gh/rec/167/orig -> origin/gh/rec/167/orig 2025-12-04T09:17:11.4981747Z * [new branch] gh/rec/168/base -> origin/gh/rec/168/base 2025-12-04T09:17:11.4983568Z * [new branch] gh/rec/168/head -> origin/gh/rec/168/head 2025-12-04T09:17:11.4985404Z * [new branch] gh/rec/168/orig -> origin/gh/rec/168/orig 2025-12-04T09:17:11.4987936Z * [new branch] gh/rec/169/base -> origin/gh/rec/169/base 2025-12-04T09:17:11.4989825Z * [new branch] gh/rec/169/head -> origin/gh/rec/169/head 2025-12-04T09:17:11.4991638Z * [new branch] gh/rec/169/orig -> origin/gh/rec/169/orig 2025-12-04T09:17:11.4994101Z * [new branch] gh/rec/170/base -> origin/gh/rec/170/base 2025-12-04T09:17:11.4995892Z * [new branch] gh/rec/170/head -> origin/gh/rec/170/head 2025-12-04T09:17:11.4997752Z * [new branch] gh/rec/170/orig -> origin/gh/rec/170/orig 2025-12-04T09:17:11.5000849Z * [new branch] gh/rec/171/base -> origin/gh/rec/171/base 2025-12-04T09:17:11.5002684Z * [new branch] gh/rec/171/head -> origin/gh/rec/171/head 2025-12-04T09:17:11.5004439Z * [new branch] gh/rec/171/orig -> origin/gh/rec/171/orig 2025-12-04T09:17:11.5006830Z * [new branch] gh/rec/172/base -> origin/gh/rec/172/base 2025-12-04T09:17:11.5008922Z * [new branch] gh/rec/172/head -> origin/gh/rec/172/head 2025-12-04T09:17:11.5010516Z * [new branch] gh/rec/172/orig -> origin/gh/rec/172/orig 2025-12-04T09:17:11.5013038Z * [new branch] gh/rec/173/base -> origin/gh/rec/173/base 2025-12-04T09:17:11.5014863Z * [new branch] gh/rec/173/head -> origin/gh/rec/173/head 2025-12-04T09:17:11.5016638Z * [new branch] gh/rec/173/orig -> origin/gh/rec/173/orig 2025-12-04T09:17:11.5019120Z * [new branch] gh/rec/174/base -> origin/gh/rec/174/base 2025-12-04T09:17:11.5020964Z * [new branch] gh/rec/174/head -> origin/gh/rec/174/head 2025-12-04T09:17:11.5022778Z * [new branch] gh/rec/174/orig -> origin/gh/rec/174/orig 2025-12-04T09:17:11.5025332Z * [new branch] gh/rec/175/base -> origin/gh/rec/175/base 2025-12-04T09:17:11.5027212Z * [new branch] gh/rec/175/head -> origin/gh/rec/175/head 2025-12-04T09:17:11.5029031Z * [new branch] gh/rec/175/orig -> origin/gh/rec/175/orig 2025-12-04T09:17:11.5031690Z * [new branch] gh/rec/176/base -> origin/gh/rec/176/base 2025-12-04T09:17:11.5033892Z * [new branch] gh/rec/176/head -> origin/gh/rec/176/head 2025-12-04T09:17:11.5035729Z * [new branch] gh/rec/176/orig -> origin/gh/rec/176/orig 2025-12-04T09:17:11.5038238Z * [new branch] gh/rec/177/base -> origin/gh/rec/177/base 2025-12-04T09:17:11.5040447Z * [new branch] gh/rec/177/head -> origin/gh/rec/177/head 2025-12-04T09:17:11.5042016Z * [new branch] gh/rec/177/orig -> origin/gh/rec/177/orig 2025-12-04T09:17:11.5045074Z * [new branch] gh/robert-hardwick/3/base -> origin/gh/robert-hardwick/3/base 2025-12-04T09:17:11.5046979Z * [new branch] gh/robert-hardwick/3/head -> origin/gh/robert-hardwick/3/head 2025-12-04T09:17:11.5048910Z * [new branch] gh/robert-hardwick/3/orig -> origin/gh/robert-hardwick/3/orig 2025-12-04T09:17:11.5051437Z * [new branch] gh/robert-hardwick/4/base -> origin/gh/robert-hardwick/4/base 2025-12-04T09:17:11.5053241Z * [new branch] gh/robert-hardwick/4/head -> origin/gh/robert-hardwick/4/head 2025-12-04T09:17:11.5055050Z * [new branch] gh/robert-hardwick/4/orig -> origin/gh/robert-hardwick/4/orig 2025-12-04T09:17:11.5057503Z * [new branch] gh/robert-hardwick/5/base -> origin/gh/robert-hardwick/5/base 2025-12-04T09:17:11.5059391Z * [new branch] gh/robert-hardwick/5/head -> origin/gh/robert-hardwick/5/head 2025-12-04T09:17:11.5061358Z * [new branch] gh/robert-hardwick/5/orig -> origin/gh/robert-hardwick/5/orig 2025-12-04T09:17:11.5063835Z * [new branch] gh/robert-hardwick/6/base -> origin/gh/robert-hardwick/6/base 2025-12-04T09:17:11.5065639Z * [new branch] gh/robert-hardwick/6/head -> origin/gh/robert-hardwick/6/head 2025-12-04T09:17:11.5067432Z * [new branch] gh/robert-hardwick/6/orig -> origin/gh/robert-hardwick/6/orig 2025-12-04T09:17:11.5070326Z * [new branch] gh/robert-hardwick/7/base -> origin/gh/robert-hardwick/7/base 2025-12-04T09:17:11.5072633Z * [new branch] gh/robert-hardwick/7/head -> origin/gh/robert-hardwick/7/head 2025-12-04T09:17:11.5074520Z * [new branch] gh/robert-hardwick/7/orig -> origin/gh/robert-hardwick/7/orig 2025-12-04T09:17:11.5077029Z * [new branch] gh/robert-hardwick/8/base -> origin/gh/robert-hardwick/8/base 2025-12-04T09:17:11.5078910Z * [new branch] gh/robert-hardwick/8/head -> origin/gh/robert-hardwick/8/head 2025-12-04T09:17:11.5080949Z * [new branch] gh/robert-hardwick/8/orig -> origin/gh/robert-hardwick/8/orig 2025-12-04T09:17:11.5083417Z * [new branch] gh/robert-hardwick/9/base -> origin/gh/robert-hardwick/9/base 2025-12-04T09:17:11.5085245Z * [new branch] gh/robert-hardwick/9/head -> origin/gh/robert-hardwick/9/head 2025-12-04T09:17:11.5087050Z * [new branch] gh/robert-hardwick/9/orig -> origin/gh/robert-hardwick/9/orig 2025-12-04T09:17:11.5090131Z * [new branch] gh/rtimpe/1/base -> origin/gh/rtimpe/1/base 2025-12-04T09:17:11.5091961Z * [new branch] gh/rtimpe/1/head -> origin/gh/rtimpe/1/head 2025-12-04T09:17:11.5094427Z * [new branch] gh/rtimpe/2/base -> origin/gh/rtimpe/2/base 2025-12-04T09:17:11.5096149Z * [new branch] gh/rtimpe/2/head -> origin/gh/rtimpe/2/head 2025-12-04T09:17:11.5098852Z * [new branch] gh/rtimpe/22/base -> origin/gh/rtimpe/22/base 2025-12-04T09:17:11.5100691Z * [new branch] gh/rtimpe/22/head -> origin/gh/rtimpe/22/head 2025-12-04T09:17:11.5102608Z * [new branch] gh/rtimpe/22/orig -> origin/gh/rtimpe/22/orig 2025-12-04T09:17:11.5105027Z * [new branch] gh/rtimpe/23/base -> origin/gh/rtimpe/23/base 2025-12-04T09:17:11.5107110Z * [new branch] gh/rtimpe/23/head -> origin/gh/rtimpe/23/head 2025-12-04T09:17:11.5108731Z * [new branch] gh/rtimpe/23/orig -> origin/gh/rtimpe/23/orig 2025-12-04T09:17:11.5111180Z * [new branch] gh/rtimpe/24/base -> origin/gh/rtimpe/24/base 2025-12-04T09:17:11.5113015Z * [new branch] gh/rtimpe/24/head -> origin/gh/rtimpe/24/head 2025-12-04T09:17:11.5114791Z * [new branch] gh/rtimpe/24/orig -> origin/gh/rtimpe/24/orig 2025-12-04T09:17:11.5117241Z * [new branch] gh/rtimpe/25/base -> origin/gh/rtimpe/25/base 2025-12-04T09:17:11.5119058Z * [new branch] gh/rtimpe/25/head -> origin/gh/rtimpe/25/head 2025-12-04T09:17:11.5121103Z * [new branch] gh/rtimpe/25/orig -> origin/gh/rtimpe/25/orig 2025-12-04T09:17:11.5123638Z * [new branch] gh/rtimpe/26/base -> origin/gh/rtimpe/26/base 2025-12-04T09:17:11.5125419Z * [new branch] gh/rtimpe/26/head -> origin/gh/rtimpe/26/head 2025-12-04T09:17:11.5127275Z * [new branch] gh/rtimpe/26/orig -> origin/gh/rtimpe/26/orig 2025-12-04T09:17:11.5129704Z * [new branch] gh/rtimpe/27/base -> origin/gh/rtimpe/27/base 2025-12-04T09:17:11.5131541Z * [new branch] gh/rtimpe/27/head -> origin/gh/rtimpe/27/head 2025-12-04T09:17:11.5133419Z * [new branch] gh/rtimpe/27/orig -> origin/gh/rtimpe/27/orig 2025-12-04T09:17:11.5135886Z * [new branch] gh/rtimpe/28/base -> origin/gh/rtimpe/28/base 2025-12-04T09:17:11.5137772Z * [new branch] gh/rtimpe/28/head -> origin/gh/rtimpe/28/head 2025-12-04T09:17:11.5139643Z * [new branch] gh/rtimpe/28/orig -> origin/gh/rtimpe/28/orig 2025-12-04T09:17:11.5142134Z * [new branch] gh/rtimpe/29/base -> origin/gh/rtimpe/29/base 2025-12-04T09:17:11.5143994Z * [new branch] gh/rtimpe/29/head -> origin/gh/rtimpe/29/head 2025-12-04T09:17:11.5145800Z * [new branch] gh/rtimpe/29/orig -> origin/gh/rtimpe/29/orig 2025-12-04T09:17:11.5148327Z * [new branch] gh/rtimpe/3/base -> origin/gh/rtimpe/3/base 2025-12-04T09:17:11.5150076Z * [new branch] gh/rtimpe/3/head -> origin/gh/rtimpe/3/head 2025-12-04T09:17:11.5152568Z * [new branch] gh/rtimpe/30/base -> origin/gh/rtimpe/30/base 2025-12-04T09:17:11.5154381Z * [new branch] gh/rtimpe/30/head -> origin/gh/rtimpe/30/head 2025-12-04T09:17:11.5156200Z * [new branch] gh/rtimpe/30/orig -> origin/gh/rtimpe/30/orig 2025-12-04T09:17:11.5158666Z * [new branch] gh/rtimpe/31/base -> origin/gh/rtimpe/31/base 2025-12-04T09:17:11.5160533Z * [new branch] gh/rtimpe/31/head -> origin/gh/rtimpe/31/head 2025-12-04T09:17:11.5162463Z * [new branch] gh/rtimpe/31/orig -> origin/gh/rtimpe/31/orig 2025-12-04T09:17:11.5164910Z * [new branch] gh/rtimpe/32/base -> origin/gh/rtimpe/32/base 2025-12-04T09:17:11.5166700Z * [new branch] gh/rtimpe/32/head -> origin/gh/rtimpe/32/head 2025-12-04T09:17:11.5168605Z * [new branch] gh/rtimpe/32/orig -> origin/gh/rtimpe/32/orig 2025-12-04T09:17:11.5171197Z * [new branch] gh/rtimpe/33/base -> origin/gh/rtimpe/33/base 2025-12-04T09:17:11.5173068Z * [new branch] gh/rtimpe/33/head -> origin/gh/rtimpe/33/head 2025-12-04T09:17:11.5174866Z * [new branch] gh/rtimpe/33/orig -> origin/gh/rtimpe/33/orig 2025-12-04T09:17:11.5177259Z * [new branch] gh/rtimpe/34/base -> origin/gh/rtimpe/34/base 2025-12-04T09:17:11.5179199Z * [new branch] gh/rtimpe/34/head -> origin/gh/rtimpe/34/head 2025-12-04T09:17:11.5181112Z * [new branch] gh/rtimpe/34/orig -> origin/gh/rtimpe/34/orig 2025-12-04T09:17:11.5183566Z * [new branch] gh/rtimpe/35/base -> origin/gh/rtimpe/35/base 2025-12-04T09:17:11.5185412Z * [new branch] gh/rtimpe/35/head -> origin/gh/rtimpe/35/head 2025-12-04T09:17:11.5187327Z * [new branch] gh/rtimpe/35/orig -> origin/gh/rtimpe/35/orig 2025-12-04T09:17:11.5189756Z * [new branch] gh/rtimpe/4/base -> origin/gh/rtimpe/4/base 2025-12-04T09:17:11.5191694Z * [new branch] gh/rtimpe/4/head -> origin/gh/rtimpe/4/head 2025-12-04T09:17:11.5194963Z * [new branch] gh/ruisizhang123/1/base -> origin/gh/ruisizhang123/1/base 2025-12-04T09:17:11.5196838Z * [new branch] gh/ruisizhang123/1/head -> origin/gh/ruisizhang123/1/head 2025-12-04T09:17:11.5198669Z * [new branch] gh/ruisizhang123/1/orig -> origin/gh/ruisizhang123/1/orig 2025-12-04T09:17:11.5201637Z * [new branch] gh/ruisizhang123/4/base -> origin/gh/ruisizhang123/4/base 2025-12-04T09:17:11.5203432Z * [new branch] gh/ruisizhang123/4/head -> origin/gh/ruisizhang123/4/head 2025-12-04T09:17:11.5205239Z * [new branch] gh/ruisizhang123/4/orig -> origin/gh/ruisizhang123/4/orig 2025-12-04T09:17:11.5207684Z * [new branch] gh/ruisizhang123/5/base -> origin/gh/ruisizhang123/5/base 2025-12-04T09:17:11.5209494Z * [new branch] gh/ruisizhang123/5/head -> origin/gh/ruisizhang123/5/head 2025-12-04T09:17:11.5211289Z * [new branch] gh/ruisizhang123/5/orig -> origin/gh/ruisizhang123/5/orig 2025-12-04T09:17:11.5213915Z * [new branch] gh/ruisizhang123/6/base -> origin/gh/ruisizhang123/6/base 2025-12-04T09:17:11.5216155Z * [new branch] gh/ruisizhang123/6/head -> origin/gh/ruisizhang123/6/head 2025-12-04T09:17:11.5218023Z * [new branch] gh/ruisizhang123/6/orig -> origin/gh/ruisizhang123/6/orig 2025-12-04T09:17:11.5220726Z * [new branch] gh/ruisizhang123/7/base -> origin/gh/ruisizhang123/7/base 2025-12-04T09:17:11.5222617Z * [new branch] gh/ruisizhang123/7/head -> origin/gh/ruisizhang123/7/head 2025-12-04T09:17:11.5224552Z * [new branch] gh/ruisizhang123/7/orig -> origin/gh/ruisizhang123/7/orig 2025-12-04T09:17:11.5226918Z * [new branch] gh/ruisizhang123/8/base -> origin/gh/ruisizhang123/8/base 2025-12-04T09:17:11.5228871Z * [new branch] gh/ruisizhang123/8/head -> origin/gh/ruisizhang123/8/head 2025-12-04T09:17:11.5230573Z * [new branch] gh/ruisizhang123/8/orig -> origin/gh/ruisizhang123/8/orig 2025-12-04T09:17:11.5233027Z * [new branch] gh/ruisizhang123/9/base -> origin/gh/ruisizhang123/9/base 2025-12-04T09:17:11.5234858Z * [new branch] gh/ruisizhang123/9/head -> origin/gh/ruisizhang123/9/head 2025-12-04T09:17:11.5236680Z * [new branch] gh/ruisizhang123/9/orig -> origin/gh/ruisizhang123/9/orig 2025-12-04T09:17:11.5239954Z * [new branch] gh/seemethere/52/base -> origin/gh/seemethere/52/base 2025-12-04T09:17:11.5241816Z * [new branch] gh/seemethere/52/head -> origin/gh/seemethere/52/head 2025-12-04T09:17:11.5243647Z * [new branch] gh/seemethere/52/orig -> origin/gh/seemethere/52/orig 2025-12-04T09:17:11.5246212Z * [new branch] gh/seemethere/53/base -> origin/gh/seemethere/53/base 2025-12-04T09:17:11.5248079Z * [new branch] gh/seemethere/53/head -> origin/gh/seemethere/53/head 2025-12-04T09:17:11.5250043Z * [new branch] gh/seemethere/53/orig -> origin/gh/seemethere/53/orig 2025-12-04T09:17:11.5252401Z * [new branch] gh/seemethere/54/base -> origin/gh/seemethere/54/base 2025-12-04T09:17:11.5254354Z * [new branch] gh/seemethere/54/head -> origin/gh/seemethere/54/head 2025-12-04T09:17:11.5256381Z * [new branch] gh/seemethere/54/orig -> origin/gh/seemethere/54/orig 2025-12-04T09:17:11.5258589Z * [new branch] gh/seemethere/55/base -> origin/gh/seemethere/55/base 2025-12-04T09:17:11.5260436Z * [new branch] gh/seemethere/55/head -> origin/gh/seemethere/55/head 2025-12-04T09:17:11.5262245Z * [new branch] gh/seemethere/55/orig -> origin/gh/seemethere/55/orig 2025-12-04T09:17:11.5264614Z * [new branch] gh/seemethere/59/base -> origin/gh/seemethere/59/base 2025-12-04T09:17:11.5266613Z * [new branch] gh/seemethere/59/head -> origin/gh/seemethere/59/head 2025-12-04T09:17:11.5268387Z * [new branch] gh/seemethere/59/orig -> origin/gh/seemethere/59/orig 2025-12-04T09:17:11.5277325Z * [new branch] gh/seemethere/62/base -> origin/gh/seemethere/62/base 2025-12-04T09:17:11.5277640Z * [new branch] gh/seemethere/62/head -> origin/gh/seemethere/62/head 2025-12-04T09:17:11.5277876Z * [new branch] gh/seemethere/62/orig -> origin/gh/seemethere/62/orig 2025-12-04T09:17:11.5278096Z * [new branch] gh/seemethere/63/base -> origin/gh/seemethere/63/base 2025-12-04T09:17:11.5278669Z * [new branch] gh/seemethere/63/head -> origin/gh/seemethere/63/head 2025-12-04T09:17:11.5281079Z * [new branch] gh/seemethere/63/orig -> origin/gh/seemethere/63/orig 2025-12-04T09:17:11.5283443Z * [new branch] gh/seemethere/71/base -> origin/gh/seemethere/71/base 2025-12-04T09:17:11.5285203Z * [new branch] gh/seemethere/71/head -> origin/gh/seemethere/71/head 2025-12-04T09:17:11.5287055Z * [new branch] gh/seemethere/71/orig -> origin/gh/seemethere/71/orig 2025-12-04T09:17:11.5289555Z * [new branch] gh/seemethere/72/base -> origin/gh/seemethere/72/base 2025-12-04T09:17:11.5291371Z * [new branch] gh/seemethere/72/head -> origin/gh/seemethere/72/head 2025-12-04T09:17:11.5293234Z * [new branch] gh/seemethere/72/orig -> origin/gh/seemethere/72/orig 2025-12-04T09:17:11.5296111Z * [new branch] gh/seemethere/73/base -> origin/gh/seemethere/73/base 2025-12-04T09:17:11.5298034Z * [new branch] gh/seemethere/73/head -> origin/gh/seemethere/73/head 2025-12-04T09:17:11.5301206Z * [new branch] gh/seemethere/73/orig -> origin/gh/seemethere/73/orig 2025-12-04T09:17:11.5304783Z * [new branch] gh/seemethere/74/base -> origin/gh/seemethere/74/base 2025-12-04T09:17:11.5306605Z * [new branch] gh/seemethere/74/head -> origin/gh/seemethere/74/head 2025-12-04T09:17:11.5308361Z * [new branch] gh/seemethere/74/orig -> origin/gh/seemethere/74/orig 2025-12-04T09:17:11.5310971Z * [new branch] gh/seemethere/75/base -> origin/gh/seemethere/75/base 2025-12-04T09:17:11.5312809Z * [new branch] gh/seemethere/75/head -> origin/gh/seemethere/75/head 2025-12-04T09:17:11.5315374Z * [new branch] gh/seemethere/75/orig -> origin/gh/seemethere/75/orig 2025-12-04T09:17:11.5318266Z * [new branch] gh/seemethere/76/base -> origin/gh/seemethere/76/base 2025-12-04T09:17:11.5320561Z * [new branch] gh/seemethere/76/head -> origin/gh/seemethere/76/head 2025-12-04T09:17:11.5322075Z * [new branch] gh/seemethere/76/orig -> origin/gh/seemethere/76/orig 2025-12-04T09:17:11.5325470Z * [new branch] gh/shunting314/145/base -> origin/gh/shunting314/145/base 2025-12-04T09:17:11.5327450Z * [new branch] gh/shunting314/145/head -> origin/gh/shunting314/145/head 2025-12-04T09:17:11.5329384Z * [new branch] gh/shunting314/145/orig -> origin/gh/shunting314/145/orig 2025-12-04T09:17:11.5332315Z * [new branch] gh/shunting314/176/base -> origin/gh/shunting314/176/base 2025-12-04T09:17:11.5334245Z * [new branch] gh/shunting314/176/head -> origin/gh/shunting314/176/head 2025-12-04T09:17:11.5336054Z * [new branch] gh/shunting314/176/orig -> origin/gh/shunting314/176/orig 2025-12-04T09:17:11.5338780Z * [new branch] gh/shunting314/249/base -> origin/gh/shunting314/249/base 2025-12-04T09:17:11.5340671Z * [new branch] gh/shunting314/249/head -> origin/gh/shunting314/249/head 2025-12-04T09:17:11.5342583Z * [new branch] gh/shunting314/249/orig -> origin/gh/shunting314/249/orig 2025-12-04T09:17:11.5345144Z * [new branch] gh/shunting314/253/base -> origin/gh/shunting314/253/base 2025-12-04T09:17:11.5346915Z * [new branch] gh/shunting314/253/head -> origin/gh/shunting314/253/head 2025-12-04T09:17:11.5348826Z * [new branch] gh/shunting314/253/orig -> origin/gh/shunting314/253/orig 2025-12-04T09:17:11.5351390Z * [new branch] gh/shunting314/256/base -> origin/gh/shunting314/256/base 2025-12-04T09:17:11.5353210Z * [new branch] gh/shunting314/256/head -> origin/gh/shunting314/256/head 2025-12-04T09:17:11.5355065Z * [new branch] gh/shunting314/256/orig -> origin/gh/shunting314/256/orig 2025-12-04T09:17:11.5357899Z * [new branch] gh/shunting314/257/base -> origin/gh/shunting314/257/base 2025-12-04T09:17:11.5359959Z * [new branch] gh/shunting314/257/head -> origin/gh/shunting314/257/head 2025-12-04T09:17:11.5361787Z * [new branch] gh/shunting314/257/orig -> origin/gh/shunting314/257/orig 2025-12-04T09:17:11.5364513Z * [new branch] gh/shunting314/258/base -> origin/gh/shunting314/258/base 2025-12-04T09:17:11.5366278Z * [new branch] gh/shunting314/258/head -> origin/gh/shunting314/258/head 2025-12-04T09:17:11.5368183Z * [new branch] gh/shunting314/258/orig -> origin/gh/shunting314/258/orig 2025-12-04T09:17:11.5370552Z * [new branch] gh/shunting314/259/base -> origin/gh/shunting314/259/base 2025-12-04T09:17:11.5372389Z * [new branch] gh/shunting314/259/head -> origin/gh/shunting314/259/head 2025-12-04T09:17:11.5374628Z * [new branch] gh/shunting314/259/orig -> origin/gh/shunting314/259/orig 2025-12-04T09:17:11.5377902Z * [new branch] gh/shunting314/260/base -> origin/gh/shunting314/260/base 2025-12-04T09:17:11.5380165Z * [new branch] gh/shunting314/260/head -> origin/gh/shunting314/260/head 2025-12-04T09:17:11.5382118Z * [new branch] gh/shunting314/260/orig -> origin/gh/shunting314/260/orig 2025-12-04T09:17:11.5384798Z * [new branch] gh/shunting314/261/base -> origin/gh/shunting314/261/base 2025-12-04T09:17:11.5386591Z * [new branch] gh/shunting314/261/head -> origin/gh/shunting314/261/head 2025-12-04T09:17:11.5388525Z * [new branch] gh/shunting314/261/orig -> origin/gh/shunting314/261/orig 2025-12-04T09:17:11.5391078Z * [new branch] gh/shunting314/262/base -> origin/gh/shunting314/262/base 2025-12-04T09:17:11.5392986Z * [new branch] gh/shunting314/262/head -> origin/gh/shunting314/262/head 2025-12-04T09:17:11.5394829Z * [new branch] gh/shunting314/262/orig -> origin/gh/shunting314/262/orig 2025-12-04T09:17:11.5397423Z * [new branch] gh/shunting314/263/base -> origin/gh/shunting314/263/base 2025-12-04T09:17:11.5399852Z * [new branch] gh/shunting314/263/head -> origin/gh/shunting314/263/head 2025-12-04T09:17:11.5402176Z * [new branch] gh/shunting314/263/orig -> origin/gh/shunting314/263/orig 2025-12-04T09:17:11.5404647Z * [new branch] gh/shunting314/264/base -> origin/gh/shunting314/264/base 2025-12-04T09:17:11.5406662Z * [new branch] gh/shunting314/264/head -> origin/gh/shunting314/264/head 2025-12-04T09:17:11.5408422Z * [new branch] gh/shunting314/264/orig -> origin/gh/shunting314/264/orig 2025-12-04T09:17:11.5410904Z * [new branch] gh/shunting314/265/base -> origin/gh/shunting314/265/base 2025-12-04T09:17:11.5412660Z * [new branch] gh/shunting314/265/head -> origin/gh/shunting314/265/head 2025-12-04T09:17:11.5414440Z * [new branch] gh/shunting314/265/orig -> origin/gh/shunting314/265/orig 2025-12-04T09:17:11.5416924Z * [new branch] gh/shunting314/266/base -> origin/gh/shunting314/266/base 2025-12-04T09:17:11.5419010Z * [new branch] gh/shunting314/266/head -> origin/gh/shunting314/266/head 2025-12-04T09:17:11.5420861Z * [new branch] gh/shunting314/266/orig -> origin/gh/shunting314/266/orig 2025-12-04T09:17:11.5423700Z * [new branch] gh/shunting314/267/base -> origin/gh/shunting314/267/base 2025-12-04T09:17:11.5425642Z * [new branch] gh/shunting314/267/head -> origin/gh/shunting314/267/head 2025-12-04T09:17:11.5427566Z * [new branch] gh/shunting314/267/orig -> origin/gh/shunting314/267/orig 2025-12-04T09:17:11.5430738Z * [new branch] gh/shunting314/268/base -> origin/gh/shunting314/268/base 2025-12-04T09:17:11.5432649Z * [new branch] gh/shunting314/268/head -> origin/gh/shunting314/268/head 2025-12-04T09:17:11.5434578Z * [new branch] gh/shunting314/268/orig -> origin/gh/shunting314/268/orig 2025-12-04T09:17:11.5437209Z * [new branch] gh/shunting314/269/base -> origin/gh/shunting314/269/base 2025-12-04T09:17:11.5439078Z * [new branch] gh/shunting314/269/head -> origin/gh/shunting314/269/head 2025-12-04T09:17:11.5441080Z * [new branch] gh/shunting314/269/orig -> origin/gh/shunting314/269/orig 2025-12-04T09:17:11.5444130Z * [new branch] gh/silverguo/1/base -> origin/gh/silverguo/1/base 2025-12-04T09:17:11.5445956Z * [new branch] gh/silverguo/1/head -> origin/gh/silverguo/1/head 2025-12-04T09:17:11.5448356Z * [new branch] gh/silverguo/2/base -> origin/gh/silverguo/2/base 2025-12-04T09:17:11.5450101Z * [new branch] gh/silverguo/2/head -> origin/gh/silverguo/2/head 2025-12-04T09:17:11.5452598Z * [new branch] gh/silverguo/3/base -> origin/gh/silverguo/3/base 2025-12-04T09:17:11.5454407Z * [new branch] gh/silverguo/3/head -> origin/gh/silverguo/3/head 2025-12-04T09:17:11.5456743Z * [new branch] gh/silverguo/4/base -> origin/gh/silverguo/4/base 2025-12-04T09:17:11.5459137Z * [new branch] gh/silverguo/4/head -> origin/gh/silverguo/4/head 2025-12-04T09:17:11.5462660Z * [new branch] gh/slayton58/39/base -> origin/gh/slayton58/39/base 2025-12-04T09:17:11.5464462Z * [new branch] gh/slayton58/39/head -> origin/gh/slayton58/39/head 2025-12-04T09:17:11.5466300Z * [new branch] gh/slayton58/39/orig -> origin/gh/slayton58/39/orig 2025-12-04T09:17:11.5468835Z * [new branch] gh/slayton58/42/base -> origin/gh/slayton58/42/base 2025-12-04T09:17:11.5470718Z * [new branch] gh/slayton58/42/head -> origin/gh/slayton58/42/head 2025-12-04T09:17:11.5472589Z * [new branch] gh/slayton58/42/orig -> origin/gh/slayton58/42/orig 2025-12-04T09:17:11.5475146Z * [new branch] gh/slayton58/43/base -> origin/gh/slayton58/43/base 2025-12-04T09:17:11.5477046Z * [new branch] gh/slayton58/43/head -> origin/gh/slayton58/43/head 2025-12-04T09:17:11.5478961Z * [new branch] gh/slayton58/43/orig -> origin/gh/slayton58/43/orig 2025-12-04T09:17:11.5481762Z * [new branch] gh/slayton58/44/base -> origin/gh/slayton58/44/base 2025-12-04T09:17:11.5483709Z * [new branch] gh/slayton58/44/head -> origin/gh/slayton58/44/head 2025-12-04T09:17:11.5485477Z * [new branch] gh/slayton58/44/orig -> origin/gh/slayton58/44/orig 2025-12-04T09:17:11.5487991Z * [new branch] gh/slayton58/45/base -> origin/gh/slayton58/45/base 2025-12-04T09:17:11.5489830Z * [new branch] gh/slayton58/45/head -> origin/gh/slayton58/45/head 2025-12-04T09:17:11.5491654Z * [new branch] gh/slayton58/45/orig -> origin/gh/slayton58/45/orig 2025-12-04T09:17:11.5494184Z * [new branch] gh/slayton58/46/base -> origin/gh/slayton58/46/base 2025-12-04T09:17:11.5496148Z * [new branch] gh/slayton58/46/head -> origin/gh/slayton58/46/head 2025-12-04T09:17:11.5498093Z * [new branch] gh/slayton58/46/orig -> origin/gh/slayton58/46/orig 2025-12-04T09:17:11.5500896Z * [new branch] gh/slayton58/6/base -> origin/gh/slayton58/6/base 2025-12-04T09:17:11.5502789Z * [new branch] gh/slayton58/6/head -> origin/gh/slayton58/6/head 2025-12-04T09:17:11.5505085Z * [new branch] gh/slayton58/7/base -> origin/gh/slayton58/7/base 2025-12-04T09:17:11.5506841Z * [new branch] gh/slayton58/7/head -> origin/gh/slayton58/7/head 2025-12-04T09:17:11.5510157Z * [new branch] gh/soulitzer/269/base -> origin/gh/soulitzer/269/base 2025-12-04T09:17:11.5511977Z * [new branch] gh/soulitzer/269/head -> origin/gh/soulitzer/269/head 2025-12-04T09:17:11.5513811Z * [new branch] gh/soulitzer/269/orig -> origin/gh/soulitzer/269/orig 2025-12-04T09:17:11.5516414Z * [new branch] gh/soulitzer/276/base -> origin/gh/soulitzer/276/base 2025-12-04T09:17:11.5518265Z * [new branch] gh/soulitzer/276/head -> origin/gh/soulitzer/276/head 2025-12-04T09:17:11.5520272Z * [new branch] gh/soulitzer/276/orig -> origin/gh/soulitzer/276/orig 2025-12-04T09:17:11.5523177Z * [new branch] gh/soulitzer/287/base -> origin/gh/soulitzer/287/base 2025-12-04T09:17:11.5525100Z * [new branch] gh/soulitzer/287/head -> origin/gh/soulitzer/287/head 2025-12-04T09:17:11.5526998Z * [new branch] gh/soulitzer/287/orig -> origin/gh/soulitzer/287/orig 2025-12-04T09:17:11.5529660Z * [new branch] gh/soulitzer/296/base -> origin/gh/soulitzer/296/base 2025-12-04T09:17:11.5531632Z * [new branch] gh/soulitzer/296/head -> origin/gh/soulitzer/296/head 2025-12-04T09:17:11.5533418Z * [new branch] gh/soulitzer/296/orig -> origin/gh/soulitzer/296/orig 2025-12-04T09:17:11.5536528Z * [new branch] gh/soulitzer/299/base -> origin/gh/soulitzer/299/base 2025-12-04T09:17:11.5538593Z * [new branch] gh/soulitzer/299/head -> origin/gh/soulitzer/299/head 2025-12-04T09:17:11.5540385Z * [new branch] gh/soulitzer/299/orig -> origin/gh/soulitzer/299/orig 2025-12-04T09:17:11.5542964Z * [new branch] gh/soulitzer/300/base -> origin/gh/soulitzer/300/base 2025-12-04T09:17:11.5544805Z * [new branch] gh/soulitzer/300/head -> origin/gh/soulitzer/300/head 2025-12-04T09:17:11.5546599Z * [new branch] gh/soulitzer/300/orig -> origin/gh/soulitzer/300/orig 2025-12-04T09:17:11.5549288Z * [new branch] gh/soulitzer/301/base -> origin/gh/soulitzer/301/base 2025-12-04T09:17:11.5551283Z * [new branch] gh/soulitzer/301/head -> origin/gh/soulitzer/301/head 2025-12-04T09:17:11.5553182Z * [new branch] gh/soulitzer/301/orig -> origin/gh/soulitzer/301/orig 2025-12-04T09:17:11.5555726Z * [new branch] gh/soulitzer/313/base -> origin/gh/soulitzer/313/base 2025-12-04T09:17:11.5557597Z * [new branch] gh/soulitzer/313/head -> origin/gh/soulitzer/313/head 2025-12-04T09:17:11.5559627Z * [new branch] gh/soulitzer/313/orig -> origin/gh/soulitzer/313/orig 2025-12-04T09:17:11.5562082Z * [new branch] gh/soulitzer/319/base -> origin/gh/soulitzer/319/base 2025-12-04T09:17:11.5563942Z * [new branch] gh/soulitzer/319/head -> origin/gh/soulitzer/319/head 2025-12-04T09:17:11.5565736Z * [new branch] gh/soulitzer/319/orig -> origin/gh/soulitzer/319/orig 2025-12-04T09:17:11.5568481Z * [new branch] gh/soulitzer/320/base -> origin/gh/soulitzer/320/base 2025-12-04T09:17:11.5570184Z * [new branch] gh/soulitzer/320/head -> origin/gh/soulitzer/320/head 2025-12-04T09:17:11.5572033Z * [new branch] gh/soulitzer/320/orig -> origin/gh/soulitzer/320/orig 2025-12-04T09:17:11.5574603Z * [new branch] gh/soulitzer/336/base -> origin/gh/soulitzer/336/base 2025-12-04T09:17:11.5576440Z * [new branch] gh/soulitzer/336/head -> origin/gh/soulitzer/336/head 2025-12-04T09:17:11.5578331Z * [new branch] gh/soulitzer/336/orig -> origin/gh/soulitzer/336/orig 2025-12-04T09:17:11.5580988Z * [new branch] gh/soulitzer/347/base -> origin/gh/soulitzer/347/base 2025-12-04T09:17:11.5582740Z * [new branch] gh/soulitzer/347/head -> origin/gh/soulitzer/347/head 2025-12-04T09:17:11.5584560Z * [new branch] gh/soulitzer/347/orig -> origin/gh/soulitzer/347/orig 2025-12-04T09:17:11.5587371Z * [new branch] gh/soulitzer/349/base -> origin/gh/soulitzer/349/base 2025-12-04T09:17:11.5589141Z * [new branch] gh/soulitzer/349/head -> origin/gh/soulitzer/349/head 2025-12-04T09:17:11.5591035Z * [new branch] gh/soulitzer/349/orig -> origin/gh/soulitzer/349/orig 2025-12-04T09:17:11.5593448Z * [new branch] gh/soulitzer/350/base -> origin/gh/soulitzer/350/base 2025-12-04T09:17:11.5595257Z * [new branch] gh/soulitzer/350/head -> origin/gh/soulitzer/350/head 2025-12-04T09:17:11.5597099Z * [new branch] gh/soulitzer/350/orig -> origin/gh/soulitzer/350/orig 2025-12-04T09:17:11.5599735Z * [new branch] gh/soulitzer/351/base -> origin/gh/soulitzer/351/base 2025-12-04T09:17:11.5602103Z * [new branch] gh/soulitzer/351/head -> origin/gh/soulitzer/351/head 2025-12-04T09:17:11.5603822Z * [new branch] gh/soulitzer/351/orig -> origin/gh/soulitzer/351/orig 2025-12-04T09:17:11.5606295Z * [new branch] gh/soulitzer/353/base -> origin/gh/soulitzer/353/base 2025-12-04T09:17:11.5608341Z * [new branch] gh/soulitzer/353/head -> origin/gh/soulitzer/353/head 2025-12-04T09:17:11.5610126Z * [new branch] gh/soulitzer/353/orig -> origin/gh/soulitzer/353/orig 2025-12-04T09:17:11.5613391Z * [new branch] gh/soulitzer/358/base -> origin/gh/soulitzer/358/base 2025-12-04T09:17:11.5615395Z * [new branch] gh/soulitzer/358/head -> origin/gh/soulitzer/358/head 2025-12-04T09:17:11.5617202Z * [new branch] gh/soulitzer/358/orig -> origin/gh/soulitzer/358/orig 2025-12-04T09:17:11.5620451Z * [new branch] gh/soulitzer/359/base -> origin/gh/soulitzer/359/base 2025-12-04T09:17:11.5622266Z * [new branch] gh/soulitzer/359/head -> origin/gh/soulitzer/359/head 2025-12-04T09:17:11.5624718Z * [new branch] gh/soulitzer/359/orig -> origin/gh/soulitzer/359/orig 2025-12-04T09:17:11.5627325Z * [new branch] gh/soulitzer/374/base -> origin/gh/soulitzer/374/base 2025-12-04T09:17:11.5629262Z * [new branch] gh/soulitzer/374/head -> origin/gh/soulitzer/374/head 2025-12-04T09:17:11.5631106Z * [new branch] gh/soulitzer/374/orig -> origin/gh/soulitzer/374/orig 2025-12-04T09:17:11.5633614Z * [new branch] gh/soulitzer/375/base -> origin/gh/soulitzer/375/base 2025-12-04T09:17:11.5635550Z * [new branch] gh/soulitzer/375/head -> origin/gh/soulitzer/375/head 2025-12-04T09:17:11.5637356Z * [new branch] gh/soulitzer/375/orig -> origin/gh/soulitzer/375/orig 2025-12-04T09:17:11.5639831Z * [new branch] gh/soulitzer/380/base -> origin/gh/soulitzer/380/base 2025-12-04T09:17:11.5641909Z * [new branch] gh/soulitzer/380/head -> origin/gh/soulitzer/380/head 2025-12-04T09:17:11.5644042Z * [new branch] gh/soulitzer/380/orig -> origin/gh/soulitzer/380/orig 2025-12-04T09:17:11.5646634Z * [new branch] gh/soulitzer/385/base -> origin/gh/soulitzer/385/base 2025-12-04T09:17:11.5648736Z * [new branch] gh/soulitzer/385/head -> origin/gh/soulitzer/385/head 2025-12-04T09:17:11.5650586Z * [new branch] gh/soulitzer/385/orig -> origin/gh/soulitzer/385/orig 2025-12-04T09:17:11.5653151Z * [new branch] gh/soulitzer/386/base -> origin/gh/soulitzer/386/base 2025-12-04T09:17:11.5655088Z * [new branch] gh/soulitzer/386/head -> origin/gh/soulitzer/386/head 2025-12-04T09:17:11.5656919Z * [new branch] gh/soulitzer/386/orig -> origin/gh/soulitzer/386/orig 2025-12-04T09:17:11.5659464Z * [new branch] gh/soulitzer/387/base -> origin/gh/soulitzer/387/base 2025-12-04T09:17:11.5661254Z * [new branch] gh/soulitzer/387/head -> origin/gh/soulitzer/387/head 2025-12-04T09:17:11.5663062Z * [new branch] gh/soulitzer/387/orig -> origin/gh/soulitzer/387/orig 2025-12-04T09:17:11.5665734Z * [new branch] gh/soulitzer/388/base -> origin/gh/soulitzer/388/base 2025-12-04T09:17:11.5667504Z * [new branch] gh/soulitzer/388/head -> origin/gh/soulitzer/388/head 2025-12-04T09:17:11.5669364Z * [new branch] gh/soulitzer/388/orig -> origin/gh/soulitzer/388/orig 2025-12-04T09:17:11.5671948Z * [new branch] gh/soulitzer/389/base -> origin/gh/soulitzer/389/base 2025-12-04T09:17:11.5673891Z * [new branch] gh/soulitzer/389/head -> origin/gh/soulitzer/389/head 2025-12-04T09:17:11.5676141Z * [new branch] gh/soulitzer/389/orig -> origin/gh/soulitzer/389/orig 2025-12-04T09:17:11.5678772Z * [new branch] gh/soulitzer/390/base -> origin/gh/soulitzer/390/base 2025-12-04T09:17:11.5680847Z * [new branch] gh/soulitzer/390/head -> origin/gh/soulitzer/390/head 2025-12-04T09:17:11.5682697Z * [new branch] gh/soulitzer/390/orig -> origin/gh/soulitzer/390/orig 2025-12-04T09:17:11.5685250Z * [new branch] gh/soulitzer/391/base -> origin/gh/soulitzer/391/base 2025-12-04T09:17:11.5687055Z * [new branch] gh/soulitzer/391/head -> origin/gh/soulitzer/391/head 2025-12-04T09:17:11.5688882Z * [new branch] gh/soulitzer/391/orig -> origin/gh/soulitzer/391/orig 2025-12-04T09:17:11.5691478Z * [new branch] gh/soulitzer/392/base -> origin/gh/soulitzer/392/base 2025-12-04T09:17:11.5693258Z * [new branch] gh/soulitzer/392/head -> origin/gh/soulitzer/392/head 2025-12-04T09:17:11.5695123Z * [new branch] gh/soulitzer/392/orig -> origin/gh/soulitzer/392/orig 2025-12-04T09:17:11.5698238Z * [new branch] gh/swolchok/728/next -> origin/gh/swolchok/728/next 2025-12-04T09:17:11.5702556Z * [new branch] gh/swolchok/819/base -> origin/gh/swolchok/819/base 2025-12-04T09:17:11.5704267Z * [new branch] gh/swolchok/819/head -> origin/gh/swolchok/819/head 2025-12-04T09:17:11.5706162Z * [new branch] gh/swolchok/819/orig -> origin/gh/swolchok/819/orig 2025-12-04T09:17:11.5708733Z * [new branch] gh/swolchok/824/base -> origin/gh/swolchok/824/base 2025-12-04T09:17:11.5710706Z * [new branch] gh/swolchok/824/head -> origin/gh/swolchok/824/head 2025-12-04T09:17:11.5712406Z * [new branch] gh/swolchok/824/orig -> origin/gh/swolchok/824/orig 2025-12-04T09:17:11.5714983Z * [new branch] gh/swolchok/829/base -> origin/gh/swolchok/829/base 2025-12-04T09:17:11.5717223Z * [new branch] gh/swolchok/829/head -> origin/gh/swolchok/829/head 2025-12-04T09:17:11.5719081Z * [new branch] gh/swolchok/829/orig -> origin/gh/swolchok/829/orig 2025-12-04T09:17:11.5722336Z * [new branch] gh/swolchok/839/base -> origin/gh/swolchok/839/base 2025-12-04T09:17:11.5724093Z * [new branch] gh/swolchok/839/head -> origin/gh/swolchok/839/head 2025-12-04T09:17:11.5725901Z * [new branch] gh/swolchok/839/orig -> origin/gh/swolchok/839/orig 2025-12-04T09:17:11.5728722Z * [new branch] gh/swolchok/841/base -> origin/gh/swolchok/841/base 2025-12-04T09:17:11.5730383Z * [new branch] gh/swolchok/841/head -> origin/gh/swolchok/841/head 2025-12-04T09:17:11.5732371Z * [new branch] gh/swolchok/841/orig -> origin/gh/swolchok/841/orig 2025-12-04T09:17:11.5735098Z * [new branch] gh/swolchok/842/base -> origin/gh/swolchok/842/base 2025-12-04T09:17:11.5737138Z * [new branch] gh/swolchok/842/head -> origin/gh/swolchok/842/head 2025-12-04T09:17:11.5738864Z * [new branch] gh/swolchok/842/orig -> origin/gh/swolchok/842/orig 2025-12-04T09:17:11.5741319Z * [new branch] gh/swolchok/845/base -> origin/gh/swolchok/845/base 2025-12-04T09:17:11.5743170Z * [new branch] gh/swolchok/845/head -> origin/gh/swolchok/845/head 2025-12-04T09:17:11.5745060Z * [new branch] gh/swolchok/845/orig -> origin/gh/swolchok/845/orig 2025-12-04T09:17:11.5747613Z * [new branch] gh/swolchok/848/base -> origin/gh/swolchok/848/base 2025-12-04T09:17:11.5749526Z * [new branch] gh/swolchok/848/head -> origin/gh/swolchok/848/head 2025-12-04T09:17:11.5751372Z * [new branch] gh/swolchok/848/orig -> origin/gh/swolchok/848/orig 2025-12-04T09:17:11.5753841Z * [new branch] gh/swolchok/856/base -> origin/gh/swolchok/856/base 2025-12-04T09:17:11.5755672Z * [new branch] gh/swolchok/856/head -> origin/gh/swolchok/856/head 2025-12-04T09:17:11.5757632Z * [new branch] gh/swolchok/856/orig -> origin/gh/swolchok/856/orig 2025-12-04T09:17:11.5760450Z * [new branch] gh/swolchok/860/base -> origin/gh/swolchok/860/base 2025-12-04T09:17:11.5762343Z * [new branch] gh/swolchok/860/head -> origin/gh/swolchok/860/head 2025-12-04T09:17:11.5764032Z * [new branch] gh/swolchok/860/orig -> origin/gh/swolchok/860/orig 2025-12-04T09:17:11.5767317Z * [new branch] gh/swolchok/861/base -> origin/gh/swolchok/861/base 2025-12-04T09:17:11.5769404Z * [new branch] gh/swolchok/861/head -> origin/gh/swolchok/861/head 2025-12-04T09:17:11.5771114Z * [new branch] gh/swolchok/861/orig -> origin/gh/swolchok/861/orig 2025-12-04T09:17:11.5774167Z * [new branch] gh/swolchok/862/base -> origin/gh/swolchok/862/base 2025-12-04T09:17:11.5775981Z * [new branch] gh/swolchok/862/head -> origin/gh/swolchok/862/head 2025-12-04T09:17:11.5777744Z * [new branch] gh/swolchok/862/orig -> origin/gh/swolchok/862/orig 2025-12-04T09:17:11.5780425Z * [new branch] gh/swolchok/863/base -> origin/gh/swolchok/863/base 2025-12-04T09:17:11.5782266Z * [new branch] gh/swolchok/863/head -> origin/gh/swolchok/863/head 2025-12-04T09:17:11.5784676Z * [new branch] gh/swolchok/863/orig -> origin/gh/swolchok/863/orig 2025-12-04T09:17:11.5787018Z * [new branch] gh/swolchok/864/base -> origin/gh/swolchok/864/base 2025-12-04T09:17:11.5788850Z * [new branch] gh/swolchok/864/head -> origin/gh/swolchok/864/head 2025-12-04T09:17:11.5790670Z * [new branch] gh/swolchok/864/orig -> origin/gh/swolchok/864/orig 2025-12-04T09:17:11.5793182Z * [new branch] gh/swolchok/865/base -> origin/gh/swolchok/865/base 2025-12-04T09:17:11.5795221Z * [new branch] gh/swolchok/865/head -> origin/gh/swolchok/865/head 2025-12-04T09:17:11.5797116Z * [new branch] gh/swolchok/865/orig -> origin/gh/swolchok/865/orig 2025-12-04T09:17:11.5800741Z * [new branch] gh/swolchok/866/base -> origin/gh/swolchok/866/base 2025-12-04T09:17:11.5802631Z * [new branch] gh/swolchok/866/head -> origin/gh/swolchok/866/head 2025-12-04T09:17:11.5804464Z * [new branch] gh/swolchok/866/orig -> origin/gh/swolchok/866/orig 2025-12-04T09:17:11.5806907Z * [new branch] gh/swolchok/867/base -> origin/gh/swolchok/867/base 2025-12-04T09:17:11.5808853Z * [new branch] gh/swolchok/867/head -> origin/gh/swolchok/867/head 2025-12-04T09:17:11.5810716Z * [new branch] gh/swolchok/867/orig -> origin/gh/swolchok/867/orig 2025-12-04T09:17:11.5813357Z * [new branch] gh/swolchok/868/base -> origin/gh/swolchok/868/base 2025-12-04T09:17:11.5815129Z * [new branch] gh/swolchok/868/head -> origin/gh/swolchok/868/head 2025-12-04T09:17:11.5816986Z * [new branch] gh/swolchok/868/orig -> origin/gh/swolchok/868/orig 2025-12-04T09:17:11.5819657Z * [new branch] gh/swolchok/869/base -> origin/gh/swolchok/869/base 2025-12-04T09:17:11.5821529Z * [new branch] gh/swolchok/869/head -> origin/gh/swolchok/869/head 2025-12-04T09:17:11.5823382Z * [new branch] gh/swolchok/869/orig -> origin/gh/swolchok/869/orig 2025-12-04T09:17:11.5826056Z * [new branch] gh/swolchok/870/base -> origin/gh/swolchok/870/base 2025-12-04T09:17:11.5827983Z * [new branch] gh/swolchok/870/head -> origin/gh/swolchok/870/head 2025-12-04T09:17:11.5829785Z * [new branch] gh/swolchok/870/orig -> origin/gh/swolchok/870/orig 2025-12-04T09:17:11.5832254Z * [new branch] gh/swolchok/871/base -> origin/gh/swolchok/871/base 2025-12-04T09:17:11.5834595Z * [new branch] gh/swolchok/871/head -> origin/gh/swolchok/871/head 2025-12-04T09:17:11.5836626Z * [new branch] gh/swolchok/871/orig -> origin/gh/swolchok/871/orig 2025-12-04T09:17:11.5839853Z * [new branch] gh/teja-rao/4/base -> origin/gh/teja-rao/4/base 2025-12-04T09:17:11.5841798Z * [new branch] gh/teja-rao/4/head -> origin/gh/teja-rao/4/head 2025-12-04T09:17:11.5843777Z * [new branch] gh/teja-rao/4/orig -> origin/gh/teja-rao/4/orig 2025-12-04T09:17:11.5846841Z * [new branch] gh/tianyu-l/2/base -> origin/gh/tianyu-l/2/base 2025-12-04T09:17:11.5848831Z * [new branch] gh/tianyu-l/2/head -> origin/gh/tianyu-l/2/head 2025-12-04T09:17:11.5850539Z * [new branch] gh/tianyu-l/2/orig -> origin/gh/tianyu-l/2/orig 2025-12-04T09:17:11.5853032Z * [new branch] gh/tianyu-l/3/base -> origin/gh/tianyu-l/3/base 2025-12-04T09:17:11.5855339Z * [new branch] gh/tianyu-l/3/orig -> origin/gh/tianyu-l/3/orig 2025-12-04T09:17:11.5857819Z * [new branch] gh/tianyu-l/4/base -> origin/gh/tianyu-l/4/base 2025-12-04T09:17:11.5859833Z * [new branch] gh/tianyu-l/4/head -> origin/gh/tianyu-l/4/head 2025-12-04T09:17:11.5861638Z * [new branch] gh/tianyu-l/4/orig -> origin/gh/tianyu-l/4/orig 2025-12-04T09:17:11.5865283Z * [new branch] gh/tugsbayasgalan/10/base -> origin/gh/tugsbayasgalan/10/base 2025-12-04T09:17:11.5867064Z * [new branch] gh/tugsbayasgalan/10/head -> origin/gh/tugsbayasgalan/10/head 2025-12-04T09:17:11.5868918Z * [new branch] gh/tugsbayasgalan/10/orig -> origin/gh/tugsbayasgalan/10/orig 2025-12-04T09:17:11.5871481Z * [new branch] gh/tugsbayasgalan/13/base -> origin/gh/tugsbayasgalan/13/base 2025-12-04T09:17:11.5873297Z * [new branch] gh/tugsbayasgalan/13/head -> origin/gh/tugsbayasgalan/13/head 2025-12-04T09:17:11.5875151Z * [new branch] gh/tugsbayasgalan/13/orig -> origin/gh/tugsbayasgalan/13/orig 2025-12-04T09:17:11.5877773Z * [new branch] gh/tugsbayasgalan/17/base -> origin/gh/tugsbayasgalan/17/base 2025-12-04T09:17:11.5879600Z * [new branch] gh/tugsbayasgalan/17/head -> origin/gh/tugsbayasgalan/17/head 2025-12-04T09:17:11.5881553Z * [new branch] gh/tugsbayasgalan/17/orig -> origin/gh/tugsbayasgalan/17/orig 2025-12-04T09:17:11.5884216Z * [new branch] gh/tugsbayasgalan/2/base -> origin/gh/tugsbayasgalan/2/base 2025-12-04T09:17:11.5885988Z * [new branch] gh/tugsbayasgalan/2/head -> origin/gh/tugsbayasgalan/2/head 2025-12-04T09:17:11.5887823Z * [new branch] gh/tugsbayasgalan/2/orig -> origin/gh/tugsbayasgalan/2/orig 2025-12-04T09:17:11.5890721Z * [new branch] gh/tugsbayasgalan/28/base -> origin/gh/tugsbayasgalan/28/base 2025-12-04T09:17:11.5892515Z * [new branch] gh/tugsbayasgalan/28/head -> origin/gh/tugsbayasgalan/28/head 2025-12-04T09:17:11.5894328Z * [new branch] gh/tugsbayasgalan/28/orig -> origin/gh/tugsbayasgalan/28/orig 2025-12-04T09:17:11.5897361Z * [new branch] gh/tugsbayasgalan/32/base -> origin/gh/tugsbayasgalan/32/base 2025-12-04T09:17:11.5899270Z * [new branch] gh/tugsbayasgalan/32/head -> origin/gh/tugsbayasgalan/32/head 2025-12-04T09:17:11.5901300Z * [new branch] gh/tugsbayasgalan/32/orig -> origin/gh/tugsbayasgalan/32/orig 2025-12-04T09:17:11.5904040Z * [new branch] gh/tugsbayasgalan/35/base -> origin/gh/tugsbayasgalan/35/base 2025-12-04T09:17:11.5905890Z * [new branch] gh/tugsbayasgalan/35/head -> origin/gh/tugsbayasgalan/35/head 2025-12-04T09:17:11.5907753Z * [new branch] gh/tugsbayasgalan/35/orig -> origin/gh/tugsbayasgalan/35/orig 2025-12-04T09:17:11.5910345Z * [new branch] gh/tugsbayasgalan/36/base -> origin/gh/tugsbayasgalan/36/base 2025-12-04T09:17:11.5912195Z * [new branch] gh/tugsbayasgalan/36/head -> origin/gh/tugsbayasgalan/36/head 2025-12-04T09:17:11.5914030Z * [new branch] gh/tugsbayasgalan/36/orig -> origin/gh/tugsbayasgalan/36/orig 2025-12-04T09:17:11.5916566Z * [new branch] gh/tugsbayasgalan/37/base -> origin/gh/tugsbayasgalan/37/base 2025-12-04T09:17:11.5918454Z * [new branch] gh/tugsbayasgalan/37/head -> origin/gh/tugsbayasgalan/37/head 2025-12-04T09:17:11.5920499Z * [new branch] gh/tugsbayasgalan/37/orig -> origin/gh/tugsbayasgalan/37/orig 2025-12-04T09:17:11.5923016Z * [new branch] gh/tugsbayasgalan/43/base -> origin/gh/tugsbayasgalan/43/base 2025-12-04T09:17:11.5924788Z * [new branch] gh/tugsbayasgalan/43/head -> origin/gh/tugsbayasgalan/43/head 2025-12-04T09:17:11.5926543Z * [new branch] gh/tugsbayasgalan/43/orig -> origin/gh/tugsbayasgalan/43/orig 2025-12-04T09:17:11.5928910Z * [new branch] gh/tugsbayasgalan/48/base -> origin/gh/tugsbayasgalan/48/base 2025-12-04T09:17:11.5930873Z * [new branch] gh/tugsbayasgalan/48/head -> origin/gh/tugsbayasgalan/48/head 2025-12-04T09:17:11.5932695Z * [new branch] gh/tugsbayasgalan/48/orig -> origin/gh/tugsbayasgalan/48/orig 2025-12-04T09:17:11.5935320Z * [new branch] gh/tugsbayasgalan/51/base -> origin/gh/tugsbayasgalan/51/base 2025-12-04T09:17:11.5937206Z * [new branch] gh/tugsbayasgalan/51/head -> origin/gh/tugsbayasgalan/51/head 2025-12-04T09:17:11.5939178Z * [new branch] gh/tugsbayasgalan/51/orig -> origin/gh/tugsbayasgalan/51/orig 2025-12-04T09:17:11.5941256Z * [new branch] gh/tugsbayasgalan/52/base -> origin/gh/tugsbayasgalan/52/base 2025-12-04T09:17:11.5943257Z * [new branch] gh/tugsbayasgalan/52/head -> origin/gh/tugsbayasgalan/52/head 2025-12-04T09:17:11.5944929Z * [new branch] gh/tugsbayasgalan/52/orig -> origin/gh/tugsbayasgalan/52/orig 2025-12-04T09:17:11.5948000Z * [new branch] gh/tugsbayasgalan/53/base -> origin/gh/tugsbayasgalan/53/base 2025-12-04T09:17:11.5949856Z * [new branch] gh/tugsbayasgalan/53/head -> origin/gh/tugsbayasgalan/53/head 2025-12-04T09:17:11.5951660Z * [new branch] gh/tugsbayasgalan/53/orig -> origin/gh/tugsbayasgalan/53/orig 2025-12-04T09:17:11.5954310Z * [new branch] gh/tugsbayasgalan/55/base -> origin/gh/tugsbayasgalan/55/base 2025-12-04T09:17:11.5956247Z * [new branch] gh/tugsbayasgalan/55/head -> origin/gh/tugsbayasgalan/55/head 2025-12-04T09:17:11.5958149Z * [new branch] gh/tugsbayasgalan/55/orig -> origin/gh/tugsbayasgalan/55/orig 2025-12-04T09:17:11.5961130Z * [new branch] gh/tugsbayasgalan/59/base -> origin/gh/tugsbayasgalan/59/base 2025-12-04T09:17:11.5963069Z * [new branch] gh/tugsbayasgalan/59/head -> origin/gh/tugsbayasgalan/59/head 2025-12-04T09:17:11.5965117Z * [new branch] gh/tugsbayasgalan/59/orig -> origin/gh/tugsbayasgalan/59/orig 2025-12-04T09:17:11.5967354Z * [new branch] gh/tugsbayasgalan/6/base -> origin/gh/tugsbayasgalan/6/base 2025-12-04T09:17:11.5969050Z * [new branch] gh/tugsbayasgalan/6/head -> origin/gh/tugsbayasgalan/6/head 2025-12-04T09:17:11.5970862Z * [new branch] gh/tugsbayasgalan/6/orig -> origin/gh/tugsbayasgalan/6/orig 2025-12-04T09:17:11.5973261Z * [new branch] gh/tugsbayasgalan/60/base -> origin/gh/tugsbayasgalan/60/base 2025-12-04T09:17:11.5974984Z * [new branch] gh/tugsbayasgalan/60/head -> origin/gh/tugsbayasgalan/60/head 2025-12-04T09:17:11.5976834Z * [new branch] gh/tugsbayasgalan/60/orig -> origin/gh/tugsbayasgalan/60/orig 2025-12-04T09:17:11.5980203Z * [new branch] gh/tugsbayasgalan/61/base -> origin/gh/tugsbayasgalan/61/base 2025-12-04T09:17:11.5981972Z * [new branch] gh/tugsbayasgalan/61/head -> origin/gh/tugsbayasgalan/61/head 2025-12-04T09:17:11.5983812Z * [new branch] gh/tugsbayasgalan/61/orig -> origin/gh/tugsbayasgalan/61/orig 2025-12-04T09:17:11.5986639Z * [new branch] gh/tugsbayasgalan/63/base -> origin/gh/tugsbayasgalan/63/base 2025-12-04T09:17:11.5988627Z * [new branch] gh/tugsbayasgalan/63/head -> origin/gh/tugsbayasgalan/63/head 2025-12-04T09:17:11.5990594Z * [new branch] gh/tugsbayasgalan/63/orig -> origin/gh/tugsbayasgalan/63/orig 2025-12-04T09:17:11.5993015Z * [new branch] gh/tugsbayasgalan/67/base -> origin/gh/tugsbayasgalan/67/base 2025-12-04T09:17:11.5994786Z * [new branch] gh/tugsbayasgalan/67/head -> origin/gh/tugsbayasgalan/67/head 2025-12-04T09:17:11.5996624Z * [new branch] gh/tugsbayasgalan/67/orig -> origin/gh/tugsbayasgalan/67/orig 2025-12-04T09:17:11.5999340Z * [new branch] gh/tugsbayasgalan/68/base -> origin/gh/tugsbayasgalan/68/base 2025-12-04T09:17:11.6001729Z * [new branch] gh/tugsbayasgalan/68/head -> origin/gh/tugsbayasgalan/68/head 2025-12-04T09:17:11.6003356Z * [new branch] gh/tugsbayasgalan/68/orig -> origin/gh/tugsbayasgalan/68/orig 2025-12-04T09:17:11.6005988Z * [new branch] gh/tugsbayasgalan/7/base -> origin/gh/tugsbayasgalan/7/base 2025-12-04T09:17:11.6007872Z * [new branch] gh/tugsbayasgalan/7/head -> origin/gh/tugsbayasgalan/7/head 2025-12-04T09:17:11.6009833Z * [new branch] gh/tugsbayasgalan/7/orig -> origin/gh/tugsbayasgalan/7/orig 2025-12-04T09:17:11.6012638Z * [new branch] gh/tugsbayasgalan/70/base -> origin/gh/tugsbayasgalan/70/base 2025-12-04T09:17:11.6014668Z * [new branch] gh/tugsbayasgalan/70/head -> origin/gh/tugsbayasgalan/70/head 2025-12-04T09:17:11.6016617Z * [new branch] gh/tugsbayasgalan/70/orig -> origin/gh/tugsbayasgalan/70/orig 2025-12-04T09:17:11.6019354Z * [new branch] gh/tugsbayasgalan/71/base -> origin/gh/tugsbayasgalan/71/base 2025-12-04T09:17:11.6021283Z * [new branch] gh/tugsbayasgalan/71/head -> origin/gh/tugsbayasgalan/71/head 2025-12-04T09:17:11.6023640Z * [new branch] gh/tugsbayasgalan/71/orig -> origin/gh/tugsbayasgalan/71/orig 2025-12-04T09:17:11.6026441Z * [new branch] gh/tugsbayasgalan/72/base -> origin/gh/tugsbayasgalan/72/base 2025-12-04T09:17:11.6028348Z * [new branch] gh/tugsbayasgalan/72/head -> origin/gh/tugsbayasgalan/72/head 2025-12-04T09:17:11.6030606Z * [new branch] gh/tugsbayasgalan/72/orig -> origin/gh/tugsbayasgalan/72/orig 2025-12-04T09:17:11.6033251Z * [new branch] gh/tugsbayasgalan/73/base -> origin/gh/tugsbayasgalan/73/base 2025-12-04T09:17:11.6035155Z * [new branch] gh/tugsbayasgalan/73/head -> origin/gh/tugsbayasgalan/73/head 2025-12-04T09:17:11.6037043Z * [new branch] gh/tugsbayasgalan/73/orig -> origin/gh/tugsbayasgalan/73/orig 2025-12-04T09:17:11.6040173Z * [new branch] gh/tugsbayasgalan/74/base -> origin/gh/tugsbayasgalan/74/base 2025-12-04T09:17:11.6042137Z * [new branch] gh/tugsbayasgalan/74/head -> origin/gh/tugsbayasgalan/74/head 2025-12-04T09:17:11.6043852Z * [new branch] gh/tugsbayasgalan/74/orig -> origin/gh/tugsbayasgalan/74/orig 2025-12-04T09:17:11.6046520Z * [new branch] gh/tugsbayasgalan/75/base -> origin/gh/tugsbayasgalan/75/base 2025-12-04T09:17:11.6048205Z * [new branch] gh/tugsbayasgalan/75/head -> origin/gh/tugsbayasgalan/75/head 2025-12-04T09:17:11.6050179Z * [new branch] gh/tugsbayasgalan/75/orig -> origin/gh/tugsbayasgalan/75/orig 2025-12-04T09:17:11.6052438Z * [new branch] gh/tugsbayasgalan/76/base -> origin/gh/tugsbayasgalan/76/base 2025-12-04T09:17:11.6054387Z * [new branch] gh/tugsbayasgalan/76/head -> origin/gh/tugsbayasgalan/76/head 2025-12-04T09:17:11.6056209Z * [new branch] gh/tugsbayasgalan/76/orig -> origin/gh/tugsbayasgalan/76/orig 2025-12-04T09:17:11.6058966Z * [new branch] gh/tugsbayasgalan/77/base -> origin/gh/tugsbayasgalan/77/base 2025-12-04T09:17:11.6060865Z * [new branch] gh/tugsbayasgalan/77/head -> origin/gh/tugsbayasgalan/77/head 2025-12-04T09:17:11.6062560Z * [new branch] gh/tugsbayasgalan/77/orig -> origin/gh/tugsbayasgalan/77/orig 2025-12-04T09:17:11.6065290Z * [new branch] gh/tugsbayasgalan/78/base -> origin/gh/tugsbayasgalan/78/base 2025-12-04T09:17:11.6067705Z * [new branch] gh/tugsbayasgalan/78/head -> origin/gh/tugsbayasgalan/78/head 2025-12-04T09:17:11.6069583Z * [new branch] gh/tugsbayasgalan/78/orig -> origin/gh/tugsbayasgalan/78/orig 2025-12-04T09:17:11.6072020Z * [new branch] gh/tugsbayasgalan/79/base -> origin/gh/tugsbayasgalan/79/base 2025-12-04T09:17:11.6073877Z * [new branch] gh/tugsbayasgalan/79/head -> origin/gh/tugsbayasgalan/79/head 2025-12-04T09:17:11.6075710Z * [new branch] gh/tugsbayasgalan/79/orig -> origin/gh/tugsbayasgalan/79/orig 2025-12-04T09:17:11.6078358Z * [new branch] gh/tugsbayasgalan/8/base -> origin/gh/tugsbayasgalan/8/base 2025-12-04T09:17:11.6080287Z * [new branch] gh/tugsbayasgalan/8/head -> origin/gh/tugsbayasgalan/8/head 2025-12-04T09:17:11.6082255Z * [new branch] gh/tugsbayasgalan/8/orig -> origin/gh/tugsbayasgalan/8/orig 2025-12-04T09:17:11.6084669Z * [new branch] gh/tugsbayasgalan/80/base -> origin/gh/tugsbayasgalan/80/base 2025-12-04T09:17:11.6086408Z * [new branch] gh/tugsbayasgalan/80/head -> origin/gh/tugsbayasgalan/80/head 2025-12-04T09:17:11.6088127Z * [new branch] gh/tugsbayasgalan/80/orig -> origin/gh/tugsbayasgalan/80/orig 2025-12-04T09:17:11.6091294Z * [new branch] gh/tugsbayasgalan/81/base -> origin/gh/tugsbayasgalan/81/base 2025-12-04T09:17:11.6093345Z * [new branch] gh/tugsbayasgalan/81/head -> origin/gh/tugsbayasgalan/81/head 2025-12-04T09:17:11.6094868Z * [new branch] gh/tugsbayasgalan/81/orig -> origin/gh/tugsbayasgalan/81/orig 2025-12-04T09:17:11.6098142Z * [new branch] gh/tugsbayasgalan/82/base -> origin/gh/tugsbayasgalan/82/base 2025-12-04T09:17:11.6100139Z * [new branch] gh/tugsbayasgalan/82/head -> origin/gh/tugsbayasgalan/82/head 2025-12-04T09:17:11.6104829Z * [new branch] gh/tugsbayasgalan/82/orig -> origin/gh/tugsbayasgalan/82/orig 2025-12-04T09:17:11.6107077Z * [new branch] gh/tugsbayasgalan/83/base -> origin/gh/tugsbayasgalan/83/base 2025-12-04T09:17:11.6108924Z * [new branch] gh/tugsbayasgalan/83/head -> origin/gh/tugsbayasgalan/83/head 2025-12-04T09:17:11.6110836Z * [new branch] gh/tugsbayasgalan/83/orig -> origin/gh/tugsbayasgalan/83/orig 2025-12-04T09:17:11.6113151Z * [new branch] gh/tugsbayasgalan/84/base -> origin/gh/tugsbayasgalan/84/base 2025-12-04T09:17:11.6115054Z * [new branch] gh/tugsbayasgalan/84/head -> origin/gh/tugsbayasgalan/84/head 2025-12-04T09:17:11.6116900Z * [new branch] gh/tugsbayasgalan/84/orig -> origin/gh/tugsbayasgalan/84/orig 2025-12-04T09:17:11.6119511Z * [new branch] gh/tugsbayasgalan/85/base -> origin/gh/tugsbayasgalan/85/base 2025-12-04T09:17:11.6121486Z * [new branch] gh/tugsbayasgalan/85/head -> origin/gh/tugsbayasgalan/85/head 2025-12-04T09:17:11.6123282Z * [new branch] gh/tugsbayasgalan/85/orig -> origin/gh/tugsbayasgalan/85/orig 2025-12-04T09:17:11.6125901Z * [new branch] gh/tugsbayasgalan/86/base -> origin/gh/tugsbayasgalan/86/base 2025-12-04T09:17:11.6127823Z * [new branch] gh/tugsbayasgalan/86/head -> origin/gh/tugsbayasgalan/86/head 2025-12-04T09:17:11.6129590Z * [new branch] gh/tugsbayasgalan/86/orig -> origin/gh/tugsbayasgalan/86/orig 2025-12-04T09:17:11.6132557Z * [new branch] gh/tugsbayasgalan/87/base -> origin/gh/tugsbayasgalan/87/base 2025-12-04T09:17:11.6134437Z * [new branch] gh/tugsbayasgalan/87/head -> origin/gh/tugsbayasgalan/87/head 2025-12-04T09:17:11.6136255Z * [new branch] gh/tugsbayasgalan/87/orig -> origin/gh/tugsbayasgalan/87/orig 2025-12-04T09:17:11.6138960Z * [new branch] gh/tugsbayasgalan/88/base -> origin/gh/tugsbayasgalan/88/base 2025-12-04T09:17:11.6140735Z * [new branch] gh/tugsbayasgalan/88/head -> origin/gh/tugsbayasgalan/88/head 2025-12-04T09:17:11.6142601Z * [new branch] gh/tugsbayasgalan/88/orig -> origin/gh/tugsbayasgalan/88/orig 2025-12-04T09:17:11.6145362Z * [new branch] gh/tugsbayasgalan/89/base -> origin/gh/tugsbayasgalan/89/base 2025-12-04T09:17:11.6147159Z * [new branch] gh/tugsbayasgalan/89/head -> origin/gh/tugsbayasgalan/89/head 2025-12-04T09:17:11.6148977Z * [new branch] gh/tugsbayasgalan/89/orig -> origin/gh/tugsbayasgalan/89/orig 2025-12-04T09:17:11.6151587Z * [new branch] gh/tugsbayasgalan/9/base -> origin/gh/tugsbayasgalan/9/base 2025-12-04T09:17:11.6153347Z * [new branch] gh/tugsbayasgalan/9/head -> origin/gh/tugsbayasgalan/9/head 2025-12-04T09:17:11.6155180Z * [new branch] gh/tugsbayasgalan/9/orig -> origin/gh/tugsbayasgalan/9/orig 2025-12-04T09:17:11.6158123Z * [new branch] gh/tugsbayasgalan/90/base -> origin/gh/tugsbayasgalan/90/base 2025-12-04T09:17:11.6159879Z * [new branch] gh/tugsbayasgalan/90/head -> origin/gh/tugsbayasgalan/90/head 2025-12-04T09:17:11.6161808Z * [new branch] gh/tugsbayasgalan/90/orig -> origin/gh/tugsbayasgalan/90/orig 2025-12-04T09:17:11.6164588Z * [new branch] gh/tugsbayasgalan/91/base -> origin/gh/tugsbayasgalan/91/base 2025-12-04T09:17:11.6166220Z * [new branch] gh/tugsbayasgalan/91/head -> origin/gh/tugsbayasgalan/91/head 2025-12-04T09:17:11.6168047Z * [new branch] gh/tugsbayasgalan/91/orig -> origin/gh/tugsbayasgalan/91/orig 2025-12-04T09:17:11.6170865Z * [new branch] gh/tugsbayasgalan/92/base -> origin/gh/tugsbayasgalan/92/base 2025-12-04T09:17:11.6173277Z * [new branch] gh/tugsbayasgalan/92/head -> origin/gh/tugsbayasgalan/92/head 2025-12-04T09:17:11.6175103Z * [new branch] gh/tugsbayasgalan/92/orig -> origin/gh/tugsbayasgalan/92/orig 2025-12-04T09:17:11.6177897Z * [new branch] gh/tugsbayasgalan/93/base -> origin/gh/tugsbayasgalan/93/base 2025-12-04T09:17:11.6179734Z * [new branch] gh/tugsbayasgalan/93/head -> origin/gh/tugsbayasgalan/93/head 2025-12-04T09:17:11.6181552Z * [new branch] gh/tugsbayasgalan/93/orig -> origin/gh/tugsbayasgalan/93/orig 2025-12-04T09:17:11.6184696Z * [new branch] gh/v0i0/14/base -> origin/gh/v0i0/14/base 2025-12-04T09:17:11.6186468Z * [new branch] gh/v0i0/14/head -> origin/gh/v0i0/14/head 2025-12-04T09:17:11.6188296Z * [new branch] gh/v0i0/14/orig -> origin/gh/v0i0/14/orig 2025-12-04T09:17:11.6190686Z * [new branch] gh/v0i0/15/base -> origin/gh/v0i0/15/base 2025-12-04T09:17:11.6192653Z * [new branch] gh/v0i0/15/head -> origin/gh/v0i0/15/head 2025-12-04T09:17:11.6194456Z * [new branch] gh/v0i0/15/orig -> origin/gh/v0i0/15/orig 2025-12-04T09:17:11.6197122Z * [new branch] gh/v0i0/16/base -> origin/gh/v0i0/16/base 2025-12-04T09:17:11.6199561Z * [new branch] gh/v0i0/16/head -> origin/gh/v0i0/16/head 2025-12-04T09:17:11.6201872Z * [new branch] gh/v0i0/16/orig -> origin/gh/v0i0/16/orig 2025-12-04T09:17:11.6204189Z * [new branch] gh/v0i0/17/base -> origin/gh/v0i0/17/base 2025-12-04T09:17:11.6206112Z * [new branch] gh/v0i0/17/head -> origin/gh/v0i0/17/head 2025-12-04T09:17:11.6207906Z * [new branch] gh/v0i0/17/orig -> origin/gh/v0i0/17/orig 2025-12-04T09:17:11.6210433Z * [new branch] gh/v0i0/18/base -> origin/gh/v0i0/18/base 2025-12-04T09:17:11.6212317Z * [new branch] gh/v0i0/18/head -> origin/gh/v0i0/18/head 2025-12-04T09:17:11.6214217Z * [new branch] gh/v0i0/18/orig -> origin/gh/v0i0/18/orig 2025-12-04T09:17:11.6216744Z * [new branch] gh/v0i0/19/base -> origin/gh/v0i0/19/base 2025-12-04T09:17:11.6218736Z * [new branch] gh/v0i0/19/head -> origin/gh/v0i0/19/head 2025-12-04T09:17:11.6220463Z * [new branch] gh/v0i0/19/orig -> origin/gh/v0i0/19/orig 2025-12-04T09:17:11.6223719Z * [new branch] gh/vishal9-team/1/base -> origin/gh/vishal9-team/1/base 2025-12-04T09:17:11.6225586Z * [new branch] gh/vishal9-team/1/head -> origin/gh/vishal9-team/1/head 2025-12-04T09:17:11.6227929Z * [new branch] gh/vishal9-team/2/base -> origin/gh/vishal9-team/2/base 2025-12-04T09:17:11.6229808Z * [new branch] gh/vishal9-team/2/head -> origin/gh/vishal9-team/2/head 2025-12-04T09:17:11.6231620Z * [new branch] gh/vishal9-team/2/orig -> origin/gh/vishal9-team/2/orig 2025-12-04T09:17:11.6234346Z * [new branch] gh/vishal9-team/3/base -> origin/gh/vishal9-team/3/base 2025-12-04T09:17:11.6236670Z * [new branch] gh/vishal9-team/3/head -> origin/gh/vishal9-team/3/head 2025-12-04T09:17:11.6238435Z * [new branch] gh/vishal9-team/3/orig -> origin/gh/vishal9-team/3/orig 2025-12-04T09:17:11.6241026Z * [new branch] gh/vishal9-team/4/base -> origin/gh/vishal9-team/4/base 2025-12-04T09:17:11.6242843Z * [new branch] gh/vishal9-team/4/head -> origin/gh/vishal9-team/4/head 2025-12-04T09:17:11.6244894Z * [new branch] gh/vishal9-team/4/orig -> origin/gh/vishal9-team/4/orig 2025-12-04T09:17:11.6248270Z * [new branch] gh/vkuzo/1/next -> origin/gh/vkuzo/1/next 2025-12-04T09:17:11.6250779Z * [new branch] gh/vkuzo/2/next -> origin/gh/vkuzo/2/next 2025-12-04T09:17:11.6253435Z * [new branch] gh/vkuzo/3/next -> origin/gh/vkuzo/3/next 2025-12-04T09:17:11.6256268Z * [new branch] gh/wconstab/424/base -> origin/gh/wconstab/424/base 2025-12-04T09:17:11.6258235Z * [new branch] gh/wconstab/424/head -> origin/gh/wconstab/424/head 2025-12-04T09:17:11.6260086Z * [new branch] gh/wconstab/424/orig -> origin/gh/wconstab/424/orig 2025-12-04T09:17:11.6262597Z * [new branch] gh/wconstab/435/base -> origin/gh/wconstab/435/base 2025-12-04T09:17:11.6264605Z * [new branch] gh/wconstab/435/head -> origin/gh/wconstab/435/head 2025-12-04T09:17:11.6266469Z * [new branch] gh/wconstab/435/orig -> origin/gh/wconstab/435/orig 2025-12-04T09:17:11.6268890Z * [new branch] gh/wconstab/444/base -> origin/gh/wconstab/444/base 2025-12-04T09:17:11.6270885Z * [new branch] gh/wconstab/444/head -> origin/gh/wconstab/444/head 2025-12-04T09:17:11.6272655Z * [new branch] gh/wconstab/444/orig -> origin/gh/wconstab/444/orig 2025-12-04T09:17:11.6275267Z * [new branch] gh/wconstab/447/base -> origin/gh/wconstab/447/base 2025-12-04T09:17:11.6277065Z * [new branch] gh/wconstab/447/head -> origin/gh/wconstab/447/head 2025-12-04T09:17:11.6279401Z * [new branch] gh/wconstab/447/orig -> origin/gh/wconstab/447/orig 2025-12-04T09:17:11.6282112Z * [new branch] gh/wconstab/448/base -> origin/gh/wconstab/448/base 2025-12-04T09:17:11.6284007Z * [new branch] gh/wconstab/448/head -> origin/gh/wconstab/448/head 2025-12-04T09:17:11.6285843Z * [new branch] gh/wconstab/448/orig -> origin/gh/wconstab/448/orig 2025-12-04T09:17:11.6288267Z * [new branch] gh/wconstab/449/base -> origin/gh/wconstab/449/base 2025-12-04T09:17:11.6290103Z * [new branch] gh/wconstab/449/head -> origin/gh/wconstab/449/head 2025-12-04T09:17:11.6292021Z * [new branch] gh/wconstab/449/orig -> origin/gh/wconstab/449/orig 2025-12-04T09:17:11.6294400Z * [new branch] gh/wconstab/450/base -> origin/gh/wconstab/450/base 2025-12-04T09:17:11.6296291Z * [new branch] gh/wconstab/450/head -> origin/gh/wconstab/450/head 2025-12-04T09:17:11.6298172Z * [new branch] gh/wconstab/450/orig -> origin/gh/wconstab/450/orig 2025-12-04T09:17:11.6300749Z * [new branch] gh/wconstab/451/base -> origin/gh/wconstab/451/base 2025-12-04T09:17:11.6302685Z * [new branch] gh/wconstab/451/head -> origin/gh/wconstab/451/head 2025-12-04T09:17:11.6304468Z * [new branch] gh/wconstab/451/orig -> origin/gh/wconstab/451/orig 2025-12-04T09:17:11.6307057Z * [new branch] gh/wconstab/452/base -> origin/gh/wconstab/452/base 2025-12-04T09:17:11.6308887Z * [new branch] gh/wconstab/452/head -> origin/gh/wconstab/452/head 2025-12-04T09:17:11.6310801Z * [new branch] gh/wconstab/452/orig -> origin/gh/wconstab/452/orig 2025-12-04T09:17:11.6313036Z * [new branch] gh/wconstab/453/base -> origin/gh/wconstab/453/base 2025-12-04T09:17:11.6314931Z * [new branch] gh/wconstab/453/head -> origin/gh/wconstab/453/head 2025-12-04T09:17:11.6316849Z * [new branch] gh/wconstab/453/orig -> origin/gh/wconstab/453/orig 2025-12-04T09:17:11.6319516Z * [new branch] gh/wconstab/454/base -> origin/gh/wconstab/454/base 2025-12-04T09:17:11.6321534Z * [new branch] gh/wconstab/454/head -> origin/gh/wconstab/454/head 2025-12-04T09:17:11.6323314Z * [new branch] gh/wconstab/454/orig -> origin/gh/wconstab/454/orig 2025-12-04T09:17:11.6325839Z * [new branch] gh/wconstab/455/base -> origin/gh/wconstab/455/base 2025-12-04T09:17:11.6327680Z * [new branch] gh/wconstab/455/head -> origin/gh/wconstab/455/head 2025-12-04T09:17:11.6329535Z * [new branch] gh/wconstab/455/orig -> origin/gh/wconstab/455/orig 2025-12-04T09:17:11.6332320Z * [new branch] gh/wconstab/456/base -> origin/gh/wconstab/456/base 2025-12-04T09:17:11.6334416Z * [new branch] gh/wconstab/456/head -> origin/gh/wconstab/456/head 2025-12-04T09:17:11.6336313Z * [new branch] gh/wconstab/456/orig -> origin/gh/wconstab/456/orig 2025-12-04T09:17:11.6338890Z * [new branch] gh/wconstab/457/base -> origin/gh/wconstab/457/base 2025-12-04T09:17:11.6340747Z * [new branch] gh/wconstab/457/head -> origin/gh/wconstab/457/head 2025-12-04T09:17:11.6342586Z * [new branch] gh/wconstab/457/orig -> origin/gh/wconstab/457/orig 2025-12-04T09:17:11.6345209Z * [new branch] gh/wconstab/458/base -> origin/gh/wconstab/458/base 2025-12-04T09:17:11.6347116Z * [new branch] gh/wconstab/458/head -> origin/gh/wconstab/458/head 2025-12-04T09:17:11.6349028Z * [new branch] gh/wconstab/458/orig -> origin/gh/wconstab/458/orig 2025-12-04T09:17:11.6351473Z * [new branch] gh/wconstab/459/base -> origin/gh/wconstab/459/base 2025-12-04T09:17:11.6353347Z * [new branch] gh/wconstab/459/head -> origin/gh/wconstab/459/head 2025-12-04T09:17:11.6355076Z * [new branch] gh/wconstab/459/orig -> origin/gh/wconstab/459/orig 2025-12-04T09:17:11.6358276Z * [new branch] gh/wconstab/460/base -> origin/gh/wconstab/460/base 2025-12-04T09:17:11.6360589Z * [new branch] gh/wconstab/460/head -> origin/gh/wconstab/460/head 2025-12-04T09:17:11.6362456Z * [new branch] gh/wconstab/460/orig -> origin/gh/wconstab/460/orig 2025-12-04T09:17:11.6365129Z * [new branch] gh/wconstab/461/base -> origin/gh/wconstab/461/base 2025-12-04T09:17:11.6367226Z * [new branch] gh/wconstab/461/head -> origin/gh/wconstab/461/head 2025-12-04T09:17:11.6368777Z * [new branch] gh/wconstab/461/orig -> origin/gh/wconstab/461/orig 2025-12-04T09:17:11.6371301Z * [new branch] gh/wconstab/462/base -> origin/gh/wconstab/462/base 2025-12-04T09:17:11.6373414Z * [new branch] gh/wconstab/462/head -> origin/gh/wconstab/462/head 2025-12-04T09:17:11.6375278Z * [new branch] gh/wconstab/462/orig -> origin/gh/wconstab/462/orig 2025-12-04T09:17:11.6377984Z * [new branch] gh/wconstab/463/base -> origin/gh/wconstab/463/base 2025-12-04T09:17:11.6379839Z * [new branch] gh/wconstab/463/head -> origin/gh/wconstab/463/head 2025-12-04T09:17:11.6381747Z * [new branch] gh/wconstab/463/orig -> origin/gh/wconstab/463/orig 2025-12-04T09:17:11.6384403Z * [new branch] gh/wconstab/464/base -> origin/gh/wconstab/464/base 2025-12-04T09:17:11.6386645Z * [new branch] gh/wconstab/464/head -> origin/gh/wconstab/464/head 2025-12-04T09:17:11.6388280Z * [new branch] gh/wconstab/464/orig -> origin/gh/wconstab/464/orig 2025-12-04T09:17:11.6390687Z * [new branch] gh/wconstab/465/base -> origin/gh/wconstab/465/base 2025-12-04T09:17:11.6392603Z * [new branch] gh/wconstab/465/head -> origin/gh/wconstab/465/head 2025-12-04T09:17:11.6394273Z * [new branch] gh/wconstab/465/orig -> origin/gh/wconstab/465/orig 2025-12-04T09:17:11.6397063Z * [new branch] gh/wconstab/466/base -> origin/gh/wconstab/466/base 2025-12-04T09:17:11.6398890Z * [new branch] gh/wconstab/466/head -> origin/gh/wconstab/466/head 2025-12-04T09:17:11.6401276Z * [new branch] gh/wconstab/466/orig -> origin/gh/wconstab/466/orig 2025-12-04T09:17:11.6404293Z * [new branch] gh/wconstab/467/base -> origin/gh/wconstab/467/base 2025-12-04T09:17:11.6406126Z * [new branch] gh/wconstab/467/head -> origin/gh/wconstab/467/head 2025-12-04T09:17:11.6408030Z * [new branch] gh/wconstab/467/orig -> origin/gh/wconstab/467/orig 2025-12-04T09:17:11.6410561Z * [new branch] gh/wconstab/468/base -> origin/gh/wconstab/468/base 2025-12-04T09:17:11.6412361Z * [new branch] gh/wconstab/468/head -> origin/gh/wconstab/468/head 2025-12-04T09:17:11.6414168Z * [new branch] gh/wconstab/468/orig -> origin/gh/wconstab/468/orig 2025-12-04T09:17:11.6417401Z * [new branch] gh/weifengpy/39/base -> origin/gh/weifengpy/39/base 2025-12-04T09:17:11.6419153Z * [new branch] gh/weifengpy/39/head -> origin/gh/weifengpy/39/head 2025-12-04T09:17:11.6421191Z * [new branch] gh/weifengpy/39/orig -> origin/gh/weifengpy/39/orig 2025-12-04T09:17:11.6423811Z * [new branch] gh/weifengpy/40/base -> origin/gh/weifengpy/40/base 2025-12-04T09:17:11.6425668Z * [new branch] gh/weifengpy/40/head -> origin/gh/weifengpy/40/head 2025-12-04T09:17:11.6427465Z * [new branch] gh/weifengpy/40/orig -> origin/gh/weifengpy/40/orig 2025-12-04T09:17:11.6430105Z * [new branch] gh/weifengpy/41/base -> origin/gh/weifengpy/41/base 2025-12-04T09:17:11.6432020Z * [new branch] gh/weifengpy/41/head -> origin/gh/weifengpy/41/head 2025-12-04T09:17:11.6433983Z * [new branch] gh/weifengpy/41/orig -> origin/gh/weifengpy/41/orig 2025-12-04T09:17:11.6437167Z * [new branch] gh/williamwen42/250/base -> origin/gh/williamwen42/250/base 2025-12-04T09:17:11.6439022Z * [new branch] gh/williamwen42/250/head -> origin/gh/williamwen42/250/head 2025-12-04T09:17:11.6441005Z * [new branch] gh/williamwen42/250/orig -> origin/gh/williamwen42/250/orig 2025-12-04T09:17:11.6443566Z * [new branch] gh/williamwen42/279/base -> origin/gh/williamwen42/279/base 2025-12-04T09:17:11.6445686Z * [new branch] gh/williamwen42/279/head -> origin/gh/williamwen42/279/head 2025-12-04T09:17:11.6447518Z * [new branch] gh/williamwen42/279/orig -> origin/gh/williamwen42/279/orig 2025-12-04T09:17:11.6450123Z * [new branch] gh/williamwen42/282/base -> origin/gh/williamwen42/282/base 2025-12-04T09:17:11.6451965Z * [new branch] gh/williamwen42/282/head -> origin/gh/williamwen42/282/head 2025-12-04T09:17:11.6453797Z * [new branch] gh/williamwen42/282/orig -> origin/gh/williamwen42/282/orig 2025-12-04T09:17:11.6456333Z * [new branch] gh/williamwen42/287/base -> origin/gh/williamwen42/287/base 2025-12-04T09:17:11.6458323Z * [new branch] gh/williamwen42/287/head -> origin/gh/williamwen42/287/head 2025-12-04T09:17:11.6460069Z * [new branch] gh/williamwen42/287/orig -> origin/gh/williamwen42/287/orig 2025-12-04T09:17:11.6462754Z * [new branch] gh/williamwen42/288/base -> origin/gh/williamwen42/288/base 2025-12-04T09:17:11.6464407Z * [new branch] gh/williamwen42/288/head -> origin/gh/williamwen42/288/head 2025-12-04T09:17:11.6466208Z * [new branch] gh/williamwen42/288/orig -> origin/gh/williamwen42/288/orig 2025-12-04T09:17:11.6468962Z * [new branch] gh/williamwen42/296/base -> origin/gh/williamwen42/296/base 2025-12-04T09:17:11.6471005Z * [new branch] gh/williamwen42/296/head -> origin/gh/williamwen42/296/head 2025-12-04T09:17:11.6472892Z * [new branch] gh/williamwen42/296/orig -> origin/gh/williamwen42/296/orig 2025-12-04T09:17:11.6475348Z * [new branch] gh/williamwen42/297/base -> origin/gh/williamwen42/297/base 2025-12-04T09:17:11.6477204Z * [new branch] gh/williamwen42/297/head -> origin/gh/williamwen42/297/head 2025-12-04T09:17:11.6481037Z * [new branch] gh/williamwen42/297/orig -> origin/gh/williamwen42/297/orig 2025-12-04T09:17:11.6482853Z * [new branch] gh/williamwen42/306/base -> origin/gh/williamwen42/306/base 2025-12-04T09:17:11.6483452Z * [new branch] gh/williamwen42/306/head -> origin/gh/williamwen42/306/head 2025-12-04T09:17:11.6485612Z * [new branch] gh/williamwen42/306/orig -> origin/gh/williamwen42/306/orig 2025-12-04T09:17:11.6488037Z * [new branch] gh/williamwen42/309/base -> origin/gh/williamwen42/309/base 2025-12-04T09:17:11.6490002Z * [new branch] gh/williamwen42/309/head -> origin/gh/williamwen42/309/head 2025-12-04T09:17:11.6491832Z * [new branch] gh/williamwen42/309/orig -> origin/gh/williamwen42/309/orig 2025-12-04T09:17:11.6494305Z * [new branch] gh/williamwen42/310/base -> origin/gh/williamwen42/310/base 2025-12-04T09:17:11.6496204Z * [new branch] gh/williamwen42/310/head -> origin/gh/williamwen42/310/head 2025-12-04T09:17:11.6498150Z * [new branch] gh/williamwen42/310/orig -> origin/gh/williamwen42/310/orig 2025-12-04T09:17:11.6504378Z * [new branch] gh/williamwen42/311/base -> origin/gh/williamwen42/311/base 2025-12-04T09:17:11.6506222Z * [new branch] gh/williamwen42/311/head -> origin/gh/williamwen42/311/head 2025-12-04T09:17:11.6508004Z * [new branch] gh/williamwen42/311/orig -> origin/gh/williamwen42/311/orig 2025-12-04T09:17:11.6510393Z * [new branch] gh/williamwen42/319/base -> origin/gh/williamwen42/319/base 2025-12-04T09:17:11.6512216Z * [new branch] gh/williamwen42/319/head -> origin/gh/williamwen42/319/head 2025-12-04T09:17:11.6514013Z * [new branch] gh/williamwen42/319/orig -> origin/gh/williamwen42/319/orig 2025-12-04T09:17:11.6516708Z * [new branch] gh/williamwen42/325/base -> origin/gh/williamwen42/325/base 2025-12-04T09:17:11.6518695Z * [new branch] gh/williamwen42/325/head -> origin/gh/williamwen42/325/head 2025-12-04T09:17:11.6520588Z * [new branch] gh/williamwen42/325/orig -> origin/gh/williamwen42/325/orig 2025-12-04T09:17:11.6523091Z * [new branch] gh/williamwen42/326/base -> origin/gh/williamwen42/326/base 2025-12-04T09:17:11.6525186Z * [new branch] gh/williamwen42/326/head -> origin/gh/williamwen42/326/head 2025-12-04T09:17:11.6527028Z * [new branch] gh/williamwen42/326/orig -> origin/gh/williamwen42/326/orig 2025-12-04T09:17:11.6529562Z * [new branch] gh/williamwen42/327/base -> origin/gh/williamwen42/327/base 2025-12-04T09:17:11.6531374Z * [new branch] gh/williamwen42/327/head -> origin/gh/williamwen42/327/head 2025-12-04T09:17:11.6533171Z * [new branch] gh/williamwen42/327/orig -> origin/gh/williamwen42/327/orig 2025-12-04T09:17:11.6535761Z * [new branch] gh/williamwen42/328/base -> origin/gh/williamwen42/328/base 2025-12-04T09:17:11.6537857Z * [new branch] gh/williamwen42/328/head -> origin/gh/williamwen42/328/head 2025-12-04T09:17:11.6539492Z * [new branch] gh/williamwen42/328/orig -> origin/gh/williamwen42/328/orig 2025-12-04T09:17:11.6543277Z * [new branch] gh/williamwen42/329/base -> origin/gh/williamwen42/329/base 2025-12-04T09:17:11.6545208Z * [new branch] gh/williamwen42/329/head -> origin/gh/williamwen42/329/head 2025-12-04T09:17:11.6547059Z * [new branch] gh/williamwen42/329/orig -> origin/gh/williamwen42/329/orig 2025-12-04T09:17:11.6549720Z * [new branch] gh/williamwen42/330/base -> origin/gh/williamwen42/330/base 2025-12-04T09:17:11.6551719Z * [new branch] gh/williamwen42/330/head -> origin/gh/williamwen42/330/head 2025-12-04T09:17:11.6553547Z * [new branch] gh/williamwen42/330/orig -> origin/gh/williamwen42/330/orig 2025-12-04T09:17:11.6556141Z * [new branch] gh/williamwen42/331/base -> origin/gh/williamwen42/331/base 2025-12-04T09:17:11.6558037Z * [new branch] gh/williamwen42/331/head -> origin/gh/williamwen42/331/head 2025-12-04T09:17:11.6560044Z * [new branch] gh/williamwen42/331/orig -> origin/gh/williamwen42/331/orig 2025-12-04T09:17:11.6562500Z * [new branch] gh/williamwen42/332/base -> origin/gh/williamwen42/332/base 2025-12-04T09:17:11.6564317Z * [new branch] gh/williamwen42/332/head -> origin/gh/williamwen42/332/head 2025-12-04T09:17:11.6566198Z * [new branch] gh/williamwen42/332/orig -> origin/gh/williamwen42/332/orig 2025-12-04T09:17:11.6569105Z * [new branch] gh/williamwen42/333/base -> origin/gh/williamwen42/333/base 2025-12-04T09:17:11.6570881Z * [new branch] gh/williamwen42/333/head -> origin/gh/williamwen42/333/head 2025-12-04T09:17:11.6572741Z * [new branch] gh/williamwen42/333/orig -> origin/gh/williamwen42/333/orig 2025-12-04T09:17:11.6575327Z * [new branch] gh/williamwen42/334/base -> origin/gh/williamwen42/334/base 2025-12-04T09:17:11.6577285Z * [new branch] gh/williamwen42/334/head -> origin/gh/williamwen42/334/head 2025-12-04T09:17:11.6579164Z * [new branch] gh/williamwen42/334/orig -> origin/gh/williamwen42/334/orig 2025-12-04T09:17:11.6585027Z * [new branch] gh/williamwen42/335/base -> origin/gh/williamwen42/335/base 2025-12-04T09:17:11.6586939Z * [new branch] gh/williamwen42/335/head -> origin/gh/williamwen42/335/head 2025-12-04T09:17:11.6588821Z * [new branch] gh/williamwen42/335/orig -> origin/gh/williamwen42/335/orig 2025-12-04T09:17:11.6591416Z * [new branch] gh/williamwen42/336/base -> origin/gh/williamwen42/336/base 2025-12-04T09:17:11.6593198Z * [new branch] gh/williamwen42/336/head -> origin/gh/williamwen42/336/head 2025-12-04T09:17:11.6594991Z * [new branch] gh/williamwen42/336/orig -> origin/gh/williamwen42/336/orig 2025-12-04T09:17:11.6597586Z * [new branch] gh/williamwen42/337/base -> origin/gh/williamwen42/337/base 2025-12-04T09:17:11.6599514Z * [new branch] gh/williamwen42/337/head -> origin/gh/williamwen42/337/head 2025-12-04T09:17:11.6601725Z * [new branch] gh/williamwen42/337/orig -> origin/gh/williamwen42/337/orig 2025-12-04T09:17:11.6604797Z * [new branch] gh/williamwen42/338/base -> origin/gh/williamwen42/338/base 2025-12-04T09:17:11.6606757Z * [new branch] gh/williamwen42/338/head -> origin/gh/williamwen42/338/head 2025-12-04T09:17:11.6608567Z * [new branch] gh/williamwen42/338/orig -> origin/gh/williamwen42/338/orig 2025-12-04T09:17:11.6611155Z * [new branch] gh/williamwen42/339/base -> origin/gh/williamwen42/339/base 2025-12-04T09:17:11.6613148Z * [new branch] gh/williamwen42/339/head -> origin/gh/williamwen42/339/head 2025-12-04T09:17:11.6614811Z * [new branch] gh/williamwen42/339/orig -> origin/gh/williamwen42/339/orig 2025-12-04T09:17:11.6617511Z * [new branch] gh/williamwen42/340/base -> origin/gh/williamwen42/340/base 2025-12-04T09:17:11.6619336Z * [new branch] gh/williamwen42/340/head -> origin/gh/williamwen42/340/head 2025-12-04T09:17:11.6621622Z * [new branch] gh/williamwen42/340/orig -> origin/gh/williamwen42/340/orig 2025-12-04T09:17:11.6624828Z * [new branch] gh/williamwen42/341/base -> origin/gh/williamwen42/341/base 2025-12-04T09:17:11.6626943Z * [new branch] gh/williamwen42/341/head -> origin/gh/williamwen42/341/head 2025-12-04T09:17:11.6628687Z * [new branch] gh/williamwen42/341/orig -> origin/gh/williamwen42/341/orig 2025-12-04T09:17:11.6631204Z * [new branch] gh/williamwen42/342/base -> origin/gh/williamwen42/342/base 2025-12-04T09:17:11.6633192Z * [new branch] gh/williamwen42/342/head -> origin/gh/williamwen42/342/head 2025-12-04T09:17:11.6635027Z * [new branch] gh/williamwen42/342/orig -> origin/gh/williamwen42/342/orig 2025-12-04T09:17:11.6637697Z * [new branch] gh/williamwen42/343/base -> origin/gh/williamwen42/343/base 2025-12-04T09:17:11.6639606Z * [new branch] gh/williamwen42/343/head -> origin/gh/williamwen42/343/head 2025-12-04T09:17:11.6641507Z * [new branch] gh/williamwen42/343/orig -> origin/gh/williamwen42/343/orig 2025-12-04T09:17:11.6644089Z * [new branch] gh/williamwen42/344/base -> origin/gh/williamwen42/344/base 2025-12-04T09:17:11.6645896Z * [new branch] gh/williamwen42/344/head -> origin/gh/williamwen42/344/head 2025-12-04T09:17:11.6647717Z * [new branch] gh/williamwen42/344/orig -> origin/gh/williamwen42/344/orig 2025-12-04T09:17:11.6650422Z * [new branch] gh/williamwen42/345/base -> origin/gh/williamwen42/345/base 2025-12-04T09:17:11.6652764Z * [new branch] gh/williamwen42/345/head -> origin/gh/williamwen42/345/head 2025-12-04T09:17:11.6654579Z * [new branch] gh/williamwen42/345/orig -> origin/gh/williamwen42/345/orig 2025-12-04T09:17:11.6657147Z * [new branch] gh/williamwen42/346/base -> origin/gh/williamwen42/346/base 2025-12-04T09:17:11.6659230Z * [new branch] gh/williamwen42/346/head -> origin/gh/williamwen42/346/head 2025-12-04T09:17:11.6661103Z * [new branch] gh/williamwen42/346/orig -> origin/gh/williamwen42/346/orig 2025-12-04T09:17:11.6663777Z * [new branch] gh/williamwen42/347/base -> origin/gh/williamwen42/347/base 2025-12-04T09:17:11.6665580Z * [new branch] gh/williamwen42/347/head -> origin/gh/williamwen42/347/head 2025-12-04T09:17:11.6667347Z * [new branch] gh/williamwen42/347/orig -> origin/gh/williamwen42/347/orig 2025-12-04T09:17:11.6669835Z * [new branch] gh/williamwen42/348/base -> origin/gh/williamwen42/348/base 2025-12-04T09:17:11.6671571Z * [new branch] gh/williamwen42/348/head -> origin/gh/williamwen42/348/head 2025-12-04T09:17:11.6673386Z * [new branch] gh/williamwen42/348/orig -> origin/gh/williamwen42/348/orig 2025-12-04T09:17:11.6676255Z * [new branch] gh/williamwen42/349/base -> origin/gh/williamwen42/349/base 2025-12-04T09:17:11.6678120Z * [new branch] gh/williamwen42/349/head -> origin/gh/williamwen42/349/head 2025-12-04T09:17:11.6680119Z * [new branch] gh/williamwen42/349/orig -> origin/gh/williamwen42/349/orig 2025-12-04T09:17:11.6682864Z * [new branch] gh/williamwen42/350/base -> origin/gh/williamwen42/350/base 2025-12-04T09:17:11.6684726Z * [new branch] gh/williamwen42/350/head -> origin/gh/williamwen42/350/head 2025-12-04T09:17:11.6686729Z * [new branch] gh/williamwen42/350/orig -> origin/gh/williamwen42/350/orig 2025-12-04T09:17:11.6689307Z * [new branch] gh/williamwen42/351/base -> origin/gh/williamwen42/351/base 2025-12-04T09:17:11.6691310Z * [new branch] gh/williamwen42/351/head -> origin/gh/williamwen42/351/head 2025-12-04T09:17:11.6693168Z * [new branch] gh/williamwen42/351/orig -> origin/gh/williamwen42/351/orig 2025-12-04T09:17:11.6695748Z * [new branch] gh/williamwen42/352/base -> origin/gh/williamwen42/352/base 2025-12-04T09:17:11.6697588Z * [new branch] gh/williamwen42/352/head -> origin/gh/williamwen42/352/head 2025-12-04T09:17:11.6699521Z * [new branch] gh/williamwen42/352/orig -> origin/gh/williamwen42/352/orig 2025-12-04T09:17:11.6704403Z * [new branch] gh/williamwen42/353/base -> origin/gh/williamwen42/353/base 2025-12-04T09:17:11.6706254Z * [new branch] gh/williamwen42/353/head -> origin/gh/williamwen42/353/head 2025-12-04T09:17:11.6708103Z * [new branch] gh/williamwen42/353/orig -> origin/gh/williamwen42/353/orig 2025-12-04T09:17:11.6710658Z * [new branch] gh/williamwen42/354/base -> origin/gh/williamwen42/354/base 2025-12-04T09:17:11.6712707Z * [new branch] gh/williamwen42/354/head -> origin/gh/williamwen42/354/head 2025-12-04T09:17:11.6714512Z * [new branch] gh/williamwen42/354/orig -> origin/gh/williamwen42/354/orig 2025-12-04T09:17:11.6717063Z * [new branch] gh/williamwen42/355/base -> origin/gh/williamwen42/355/base 2025-12-04T09:17:11.6718934Z * [new branch] gh/williamwen42/355/head -> origin/gh/williamwen42/355/head 2025-12-04T09:17:11.6720868Z * [new branch] gh/williamwen42/355/orig -> origin/gh/williamwen42/355/orig 2025-12-04T09:17:11.6723466Z * [new branch] gh/williamwen42/356/base -> origin/gh/williamwen42/356/base 2025-12-04T09:17:11.6725308Z * [new branch] gh/williamwen42/356/head -> origin/gh/williamwen42/356/head 2025-12-04T09:17:11.6727200Z * [new branch] gh/williamwen42/356/orig -> origin/gh/williamwen42/356/orig 2025-12-04T09:17:11.6729746Z * [new branch] gh/williamwen42/357/base -> origin/gh/williamwen42/357/base 2025-12-04T09:17:11.6731661Z * [new branch] gh/williamwen42/357/head -> origin/gh/williamwen42/357/head 2025-12-04T09:17:11.6733489Z * [new branch] gh/williamwen42/357/orig -> origin/gh/williamwen42/357/orig 2025-12-04T09:17:11.6736215Z * [new branch] gh/williamwen42/358/base -> origin/gh/williamwen42/358/base 2025-12-04T09:17:11.6738143Z * [new branch] gh/williamwen42/358/head -> origin/gh/williamwen42/358/head 2025-12-04T09:17:11.6740260Z * [new branch] gh/williamwen42/358/orig -> origin/gh/williamwen42/358/orig 2025-12-04T09:17:11.6743289Z * [new branch] gh/xmfan/169/base -> origin/gh/xmfan/169/base 2025-12-04T09:17:11.6745131Z * [new branch] gh/xmfan/169/head -> origin/gh/xmfan/169/head 2025-12-04T09:17:11.6747990Z * [new branch] gh/xmfan/170/base -> origin/gh/xmfan/170/base 2025-12-04T09:17:11.6749769Z * [new branch] gh/xmfan/170/head -> origin/gh/xmfan/170/head 2025-12-04T09:17:11.6752252Z * [new branch] gh/xmfan/274/base -> origin/gh/xmfan/274/base 2025-12-04T09:17:11.6754093Z * [new branch] gh/xmfan/274/head -> origin/gh/xmfan/274/head 2025-12-04T09:17:11.6755907Z * [new branch] gh/xmfan/274/orig -> origin/gh/xmfan/274/orig 2025-12-04T09:17:11.6758367Z * [new branch] gh/xmfan/277/base -> origin/gh/xmfan/277/base 2025-12-04T09:17:11.6760463Z * [new branch] gh/xmfan/277/head -> origin/gh/xmfan/277/head 2025-12-04T09:17:11.6762428Z * [new branch] gh/xmfan/277/orig -> origin/gh/xmfan/277/orig 2025-12-04T09:17:11.6765027Z * [new branch] gh/xmfan/301/base -> origin/gh/xmfan/301/base 2025-12-04T09:17:11.6766757Z * [new branch] gh/xmfan/301/head -> origin/gh/xmfan/301/head 2025-12-04T09:17:11.6768668Z * [new branch] gh/xmfan/301/orig -> origin/gh/xmfan/301/orig 2025-12-04T09:17:11.6771013Z * [new branch] gh/xmfan/304/base -> origin/gh/xmfan/304/base 2025-12-04T09:17:11.6772788Z * [new branch] gh/xmfan/304/head -> origin/gh/xmfan/304/head 2025-12-04T09:17:11.6774622Z * [new branch] gh/xmfan/304/orig -> origin/gh/xmfan/304/orig 2025-12-04T09:17:11.6777144Z * [new branch] gh/xmfan/309/base -> origin/gh/xmfan/309/base 2025-12-04T09:17:11.6778973Z * [new branch] gh/xmfan/309/head -> origin/gh/xmfan/309/head 2025-12-04T09:17:11.6780765Z * [new branch] gh/xmfan/309/orig -> origin/gh/xmfan/309/orig 2025-12-04T09:17:11.6783244Z * [new branch] gh/xmfan/310/base -> origin/gh/xmfan/310/base 2025-12-04T09:17:11.6790124Z * [new branch] gh/xmfan/310/head -> origin/gh/xmfan/310/head 2025-12-04T09:17:11.6790538Z * [new branch] gh/xmfan/310/orig -> origin/gh/xmfan/310/orig 2025-12-04T09:17:11.6790751Z * [new branch] gh/xmfan/311/base -> origin/gh/xmfan/311/base 2025-12-04T09:17:11.6791077Z * [new branch] gh/xmfan/311/head -> origin/gh/xmfan/311/head 2025-12-04T09:17:11.6793095Z * [new branch] gh/xmfan/311/orig -> origin/gh/xmfan/311/orig 2025-12-04T09:17:11.6795548Z * [new branch] gh/xmfan/312/base -> origin/gh/xmfan/312/base 2025-12-04T09:17:11.6798028Z * [new branch] gh/xmfan/312/head -> origin/gh/xmfan/312/head 2025-12-04T09:17:11.6799812Z * [new branch] gh/xmfan/312/orig -> origin/gh/xmfan/312/orig 2025-12-04T09:17:11.6802594Z * [new branch] gh/xmfan/313/base -> origin/gh/xmfan/313/base 2025-12-04T09:17:11.6804389Z * [new branch] gh/xmfan/313/head -> origin/gh/xmfan/313/head 2025-12-04T09:17:11.6806207Z * [new branch] gh/xmfan/313/orig -> origin/gh/xmfan/313/orig 2025-12-04T09:17:11.6809337Z * [new branch] gh/xuanzhang816/27/base -> origin/gh/xuanzhang816/27/base 2025-12-04T09:17:11.6811265Z * [new branch] gh/xuanzhang816/27/head -> origin/gh/xuanzhang816/27/head 2025-12-04T09:17:11.6813051Z * [new branch] gh/xuanzhang816/27/orig -> origin/gh/xuanzhang816/27/orig 2025-12-04T09:17:11.6815648Z * [new branch] gh/xuanzhang816/32/base -> origin/gh/xuanzhang816/32/base 2025-12-04T09:17:11.6817488Z * [new branch] gh/xuanzhang816/32/head -> origin/gh/xuanzhang816/32/head 2025-12-04T09:17:11.6819327Z * [new branch] gh/xuanzhang816/32/orig -> origin/gh/xuanzhang816/32/orig 2025-12-04T09:17:11.6821904Z * [new branch] gh/xuanzhang816/33/base -> origin/gh/xuanzhang816/33/base 2025-12-04T09:17:11.6823695Z * [new branch] gh/xuanzhang816/33/head -> origin/gh/xuanzhang816/33/head 2025-12-04T09:17:11.6825498Z * [new branch] gh/xuanzhang816/33/orig -> origin/gh/xuanzhang816/33/orig 2025-12-04T09:17:11.6828478Z * [new branch] gh/xuanzhang816/34/base -> origin/gh/xuanzhang816/34/base 2025-12-04T09:17:11.6830281Z * [new branch] gh/xuanzhang816/34/head -> origin/gh/xuanzhang816/34/head 2025-12-04T09:17:11.6832146Z * [new branch] gh/xuanzhang816/34/orig -> origin/gh/xuanzhang816/34/orig 2025-12-04T09:17:11.6834872Z * [new branch] gh/xuanzhang816/35/base -> origin/gh/xuanzhang816/35/base 2025-12-04T09:17:11.6836764Z * [new branch] gh/xuanzhang816/35/head -> origin/gh/xuanzhang816/35/head 2025-12-04T09:17:11.6838811Z * [new branch] gh/xuanzhang816/35/orig -> origin/gh/xuanzhang816/35/orig 2025-12-04T09:17:11.6842056Z * [new branch] gh/yanbing-j/11/base -> origin/gh/yanbing-j/11/base 2025-12-04T09:17:11.6843851Z * [new branch] gh/yanbing-j/11/head -> origin/gh/yanbing-j/11/head 2025-12-04T09:17:11.6845669Z * [new branch] gh/yanbing-j/11/orig -> origin/gh/yanbing-j/11/orig 2025-12-04T09:17:11.6848179Z * [new branch] gh/yanbing-j/12/base -> origin/gh/yanbing-j/12/base 2025-12-04T09:17:11.6849959Z * [new branch] gh/yanbing-j/12/head -> origin/gh/yanbing-j/12/head 2025-12-04T09:17:11.6851755Z * [new branch] gh/yanbing-j/12/orig -> origin/gh/yanbing-j/12/orig 2025-12-04T09:17:11.6854271Z * [new branch] gh/yanbing-j/13/base -> origin/gh/yanbing-j/13/base 2025-12-04T09:17:11.6856132Z * [new branch] gh/yanbing-j/13/head -> origin/gh/yanbing-j/13/head 2025-12-04T09:17:11.6857957Z * [new branch] gh/yanbing-j/13/orig -> origin/gh/yanbing-j/13/orig 2025-12-04T09:17:11.6860603Z * [new branch] gh/yanbing-j/14/base -> origin/gh/yanbing-j/14/base 2025-12-04T09:17:11.6862411Z * [new branch] gh/yanbing-j/14/head -> origin/gh/yanbing-j/14/head 2025-12-04T09:17:11.6864225Z * [new branch] gh/yanbing-j/14/orig -> origin/gh/yanbing-j/14/orig 2025-12-04T09:17:11.6866684Z * [new branch] gh/yanbing-j/15/base -> origin/gh/yanbing-j/15/base 2025-12-04T09:17:11.6868605Z * [new branch] gh/yanbing-j/15/head -> origin/gh/yanbing-j/15/head 2025-12-04T09:17:11.6870356Z * [new branch] gh/yanbing-j/15/orig -> origin/gh/yanbing-j/15/orig 2025-12-04T09:17:11.6872757Z * [new branch] gh/yanbing-j/18/base -> origin/gh/yanbing-j/18/base 2025-12-04T09:17:11.6874612Z * [new branch] gh/yanbing-j/18/head -> origin/gh/yanbing-j/18/head 2025-12-04T09:17:11.6876400Z * [new branch] gh/yanbing-j/18/orig -> origin/gh/yanbing-j/18/orig 2025-12-04T09:17:11.6879044Z * [new branch] gh/yanbing-j/19/base -> origin/gh/yanbing-j/19/base 2025-12-04T09:17:11.6880917Z * [new branch] gh/yanbing-j/19/head -> origin/gh/yanbing-j/19/head 2025-12-04T09:17:11.6883188Z * [new branch] gh/yanbing-j/19/orig -> origin/gh/yanbing-j/19/orig 2025-12-04T09:17:11.6885753Z * [new branch] gh/yanbing-j/20/base -> origin/gh/yanbing-j/20/base 2025-12-04T09:17:11.6887862Z * [new branch] gh/yanbing-j/20/head -> origin/gh/yanbing-j/20/head 2025-12-04T09:17:11.6889517Z * [new branch] gh/yanbing-j/20/orig -> origin/gh/yanbing-j/20/orig 2025-12-04T09:17:11.6892031Z * [new branch] gh/yanbing-j/21/base -> origin/gh/yanbing-j/21/base 2025-12-04T09:17:11.6893945Z * [new branch] gh/yanbing-j/21/head -> origin/gh/yanbing-j/21/head 2025-12-04T09:17:11.6896393Z * [new branch] gh/yanbing-j/22/base -> origin/gh/yanbing-j/22/base 2025-12-04T09:17:11.6898211Z * [new branch] gh/yanbing-j/22/head -> origin/gh/yanbing-j/22/head 2025-12-04T09:17:11.6900127Z * [new branch] gh/yanbing-j/22/orig -> origin/gh/yanbing-j/22/orig 2025-12-04T09:17:11.6903294Z * [new branch] gh/yanbing-j/23/base -> origin/gh/yanbing-j/23/base 2025-12-04T09:17:11.6904860Z * [new branch] gh/yanbing-j/23/head -> origin/gh/yanbing-j/23/head 2025-12-04T09:17:11.6906675Z * [new branch] gh/yanbing-j/23/orig -> origin/gh/yanbing-j/23/orig 2025-12-04T09:17:11.6909180Z * [new branch] gh/yanbing-j/24/base -> origin/gh/yanbing-j/24/base 2025-12-04T09:17:11.6911143Z * [new branch] gh/yanbing-j/24/head -> origin/gh/yanbing-j/24/head 2025-12-04T09:17:11.6913102Z * [new branch] gh/yanbing-j/24/orig -> origin/gh/yanbing-j/24/orig 2025-12-04T09:17:11.6915501Z * [new branch] gh/yanbing-j/25/base -> origin/gh/yanbing-j/25/base 2025-12-04T09:17:11.6917294Z * [new branch] gh/yanbing-j/25/head -> origin/gh/yanbing-j/25/head 2025-12-04T09:17:11.6919280Z * [new branch] gh/yanbing-j/25/orig -> origin/gh/yanbing-j/25/orig 2025-12-04T09:17:11.6921785Z * [new branch] gh/yanbing-j/26/base -> origin/gh/yanbing-j/26/base 2025-12-04T09:17:11.6923603Z * [new branch] gh/yanbing-j/26/head -> origin/gh/yanbing-j/26/head 2025-12-04T09:17:11.6925385Z * [new branch] gh/yanbing-j/26/orig -> origin/gh/yanbing-j/26/orig 2025-12-04T09:17:11.6929106Z * [new branch] gh/yang-yu-hang/1/base -> origin/gh/yang-yu-hang/1/base 2025-12-04T09:17:11.6930895Z * [new branch] gh/yang-yu-hang/1/head -> origin/gh/yang-yu-hang/1/head 2025-12-04T09:17:11.6932820Z * [new branch] gh/yang-yu-hang/1/orig -> origin/gh/yang-yu-hang/1/orig 2025-12-04T09:17:11.6935424Z * [new branch] gh/yang-yu-hang/2/base -> origin/gh/yang-yu-hang/2/base 2025-12-04T09:17:11.6937469Z * [new branch] gh/yang-yu-hang/2/head -> origin/gh/yang-yu-hang/2/head 2025-12-04T09:17:11.6939531Z * [new branch] gh/yang-yu-hang/2/orig -> origin/gh/yang-yu-hang/2/orig 2025-12-04T09:17:11.6942027Z * [new branch] gh/yang-yu-hang/3/base -> origin/gh/yang-yu-hang/3/base 2025-12-04T09:17:11.6943926Z * [new branch] gh/yang-yu-hang/3/head -> origin/gh/yang-yu-hang/3/head 2025-12-04T09:17:11.6945882Z * [new branch] gh/yang-yu-hang/3/orig -> origin/gh/yang-yu-hang/3/orig 2025-12-04T09:17:11.6949058Z * [new branch] gh/yangw-dev/12/base -> origin/gh/yangw-dev/12/base 2025-12-04T09:17:11.6950890Z * [new branch] gh/yangw-dev/12/head -> origin/gh/yangw-dev/12/head 2025-12-04T09:17:11.6952708Z * [new branch] gh/yangw-dev/12/orig -> origin/gh/yangw-dev/12/orig 2025-12-04T09:17:11.6955165Z * [new branch] gh/yangw-dev/13/base -> origin/gh/yangw-dev/13/base 2025-12-04T09:17:11.6957048Z * [new branch] gh/yangw-dev/13/head -> origin/gh/yangw-dev/13/head 2025-12-04T09:17:11.6958913Z * [new branch] gh/yangw-dev/13/orig -> origin/gh/yangw-dev/13/orig 2025-12-04T09:17:11.6961738Z * [new branch] gh/yangw-dev/14/base -> origin/gh/yangw-dev/14/base 2025-12-04T09:17:11.6963580Z * [new branch] gh/yangw-dev/14/head -> origin/gh/yangw-dev/14/head 2025-12-04T09:17:11.6965341Z * [new branch] gh/yangw-dev/14/orig -> origin/gh/yangw-dev/14/orig 2025-12-04T09:17:11.6967828Z * [new branch] gh/yangw-dev/15/base -> origin/gh/yangw-dev/15/base 2025-12-04T09:17:11.6969638Z * [new branch] gh/yangw-dev/15/head -> origin/gh/yangw-dev/15/head 2025-12-04T09:17:11.6971434Z * [new branch] gh/yangw-dev/15/orig -> origin/gh/yangw-dev/15/orig 2025-12-04T09:17:11.6973884Z * [new branch] gh/yangw-dev/19/base -> origin/gh/yangw-dev/19/base 2025-12-04T09:17:11.6975694Z * [new branch] gh/yangw-dev/19/head -> origin/gh/yangw-dev/19/head 2025-12-04T09:17:11.6977529Z * [new branch] gh/yangw-dev/19/orig -> origin/gh/yangw-dev/19/orig 2025-12-04T09:17:11.6980118Z * [new branch] gh/yangw-dev/26/base -> origin/gh/yangw-dev/26/base 2025-12-04T09:17:11.6981899Z * [new branch] gh/yangw-dev/26/head -> origin/gh/yangw-dev/26/head 2025-12-04T09:17:11.6983698Z * [new branch] gh/yangw-dev/26/orig -> origin/gh/yangw-dev/26/orig 2025-12-04T09:17:11.6986301Z * [new branch] gh/yangw-dev/27/base -> origin/gh/yangw-dev/27/base 2025-12-04T09:17:11.6988236Z * [new branch] gh/yangw-dev/27/head -> origin/gh/yangw-dev/27/head 2025-12-04T09:17:11.6989971Z * [new branch] gh/yangw-dev/27/orig -> origin/gh/yangw-dev/27/orig 2025-12-04T09:17:11.6993123Z * [new branch] gh/ydwu4/292/base -> origin/gh/ydwu4/292/base 2025-12-04T09:17:11.6994908Z * [new branch] gh/ydwu4/292/head -> origin/gh/ydwu4/292/head 2025-12-04T09:17:11.6996687Z * [new branch] gh/ydwu4/292/orig -> origin/gh/ydwu4/292/orig 2025-12-04T09:17:11.6999208Z * [new branch] gh/ydwu4/294/base -> origin/gh/ydwu4/294/base 2025-12-04T09:17:11.7001386Z * [new branch] gh/ydwu4/294/head -> origin/gh/ydwu4/294/head 2025-12-04T09:17:11.7005044Z * [new branch] gh/ydwu4/294/orig -> origin/gh/ydwu4/294/orig 2025-12-04T09:17:11.7007790Z * [new branch] gh/ydwu4/295/base -> origin/gh/ydwu4/295/base 2025-12-04T09:17:11.7009729Z * [new branch] gh/ydwu4/295/head -> origin/gh/ydwu4/295/head 2025-12-04T09:17:11.7011766Z * [new branch] gh/ydwu4/295/orig -> origin/gh/ydwu4/295/orig 2025-12-04T09:17:11.7014169Z * [new branch] gh/ydwu4/296/base -> origin/gh/ydwu4/296/base 2025-12-04T09:17:11.7015891Z * [new branch] gh/ydwu4/296/head -> origin/gh/ydwu4/296/head 2025-12-04T09:17:11.7017718Z * [new branch] gh/ydwu4/296/orig -> origin/gh/ydwu4/296/orig 2025-12-04T09:17:11.7020295Z * [new branch] gh/ydwu4/306/base -> origin/gh/ydwu4/306/base 2025-12-04T09:17:11.7022209Z * [new branch] gh/ydwu4/306/head -> origin/gh/ydwu4/306/head 2025-12-04T09:17:11.7024039Z * [new branch] gh/ydwu4/306/orig -> origin/gh/ydwu4/306/orig 2025-12-04T09:17:11.7026992Z * [new branch] gh/ydwu4/312/base -> origin/gh/ydwu4/312/base 2025-12-04T09:17:11.7028818Z * [new branch] gh/ydwu4/312/head -> origin/gh/ydwu4/312/head 2025-12-04T09:17:11.7031186Z * [new branch] gh/ydwu4/312/orig -> origin/gh/ydwu4/312/orig 2025-12-04T09:17:11.7033647Z * [new branch] gh/ydwu4/322/base -> origin/gh/ydwu4/322/base 2025-12-04T09:17:11.7035489Z * [new branch] gh/ydwu4/322/head -> origin/gh/ydwu4/322/head 2025-12-04T09:17:11.7037443Z * [new branch] gh/ydwu4/322/orig -> origin/gh/ydwu4/322/orig 2025-12-04T09:17:11.7040107Z * [new branch] gh/ydwu4/327/base -> origin/gh/ydwu4/327/base 2025-12-04T09:17:11.7041983Z * [new branch] gh/ydwu4/327/head -> origin/gh/ydwu4/327/head 2025-12-04T09:17:11.7043811Z * [new branch] gh/ydwu4/327/orig -> origin/gh/ydwu4/327/orig 2025-12-04T09:17:11.7046352Z * [new branch] gh/ydwu4/328/base -> origin/gh/ydwu4/328/base 2025-12-04T09:17:11.7048125Z * [new branch] gh/ydwu4/328/head -> origin/gh/ydwu4/328/head 2025-12-04T09:17:11.7049920Z * [new branch] gh/ydwu4/328/orig -> origin/gh/ydwu4/328/orig 2025-12-04T09:17:11.7052315Z * [new branch] gh/ydwu4/329/base -> origin/gh/ydwu4/329/base 2025-12-04T09:17:11.7054129Z * [new branch] gh/ydwu4/329/head -> origin/gh/ydwu4/329/head 2025-12-04T09:17:11.7055901Z * [new branch] gh/ydwu4/329/orig -> origin/gh/ydwu4/329/orig 2025-12-04T09:17:11.7058558Z * [new branch] gh/ydwu4/330/base -> origin/gh/ydwu4/330/base 2025-12-04T09:17:11.7060283Z * [new branch] gh/ydwu4/330/head -> origin/gh/ydwu4/330/head 2025-12-04T09:17:11.7062226Z * [new branch] gh/ydwu4/330/orig -> origin/gh/ydwu4/330/orig 2025-12-04T09:17:11.7064582Z * [new branch] gh/ydwu4/331/base -> origin/gh/ydwu4/331/base 2025-12-04T09:17:11.7066856Z * [new branch] gh/ydwu4/331/head -> origin/gh/ydwu4/331/head 2025-12-04T09:17:11.7068629Z * [new branch] gh/ydwu4/331/orig -> origin/gh/ydwu4/331/orig 2025-12-04T09:17:11.7070968Z * [new branch] gh/ydwu4/332/base -> origin/gh/ydwu4/332/base 2025-12-04T09:17:11.7072874Z * [new branch] gh/ydwu4/332/head -> origin/gh/ydwu4/332/head 2025-12-04T09:17:11.7074690Z * [new branch] gh/ydwu4/332/orig -> origin/gh/ydwu4/332/orig 2025-12-04T09:17:11.7077024Z * [new branch] gh/ydwu4/333/base -> origin/gh/ydwu4/333/base 2025-12-04T09:17:11.7078872Z * [new branch] gh/ydwu4/333/head -> origin/gh/ydwu4/333/head 2025-12-04T09:17:11.7080831Z * [new branch] gh/ydwu4/333/orig -> origin/gh/ydwu4/333/orig 2025-12-04T09:17:11.7083156Z * [new branch] gh/ydwu4/334/base -> origin/gh/ydwu4/334/base 2025-12-04T09:17:11.7085055Z * [new branch] gh/ydwu4/334/head -> origin/gh/ydwu4/334/head 2025-12-04T09:17:11.7086949Z * [new branch] gh/ydwu4/334/orig -> origin/gh/ydwu4/334/orig 2025-12-04T09:17:11.7089315Z * [new branch] gh/ydwu4/335/base -> origin/gh/ydwu4/335/base 2025-12-04T09:17:11.7091134Z * [new branch] gh/ydwu4/335/head -> origin/gh/ydwu4/335/head 2025-12-04T09:17:11.7092930Z * [new branch] gh/ydwu4/335/orig -> origin/gh/ydwu4/335/orig 2025-12-04T09:17:11.7096019Z * [new branch] gh/ydwu4/337/base -> origin/gh/ydwu4/337/base 2025-12-04T09:17:11.7097836Z * [new branch] gh/ydwu4/337/head -> origin/gh/ydwu4/337/head 2025-12-04T09:17:11.7099673Z * [new branch] gh/ydwu4/337/orig -> origin/gh/ydwu4/337/orig 2025-12-04T09:17:11.7102549Z * [new branch] gh/ydwu4/339/base -> origin/gh/ydwu4/339/base 2025-12-04T09:17:11.7104393Z * [new branch] gh/ydwu4/339/head -> origin/gh/ydwu4/339/head 2025-12-04T09:17:11.7106189Z * [new branch] gh/ydwu4/339/orig -> origin/gh/ydwu4/339/orig 2025-12-04T09:17:11.7109292Z * [new branch] gh/yf225/133/base -> origin/gh/yf225/133/base 2025-12-04T09:17:11.7111189Z * [new branch] gh/yf225/133/head -> origin/gh/yf225/133/head 2025-12-04T09:17:11.7113788Z * [new branch] gh/yf225/93/base -> origin/gh/yf225/93/base 2025-12-04T09:17:11.7115531Z * [new branch] gh/yf225/93/head -> origin/gh/yf225/93/head 2025-12-04T09:17:11.7119682Z * [new branch] gh/yifuwang/152/base -> origin/gh/yifuwang/152/base 2025-12-04T09:17:11.7121992Z * [new branch] gh/yifuwang/152/head -> origin/gh/yifuwang/152/head 2025-12-04T09:17:11.7123975Z * [new branch] gh/yifuwang/152/orig -> origin/gh/yifuwang/152/orig 2025-12-04T09:17:11.7126404Z * [new branch] gh/yifuwang/195/base -> origin/gh/yifuwang/195/base 2025-12-04T09:17:11.7128402Z * [new branch] gh/yifuwang/195/head -> origin/gh/yifuwang/195/head 2025-12-04T09:17:11.7130105Z * [new branch] gh/yifuwang/195/orig -> origin/gh/yifuwang/195/orig 2025-12-04T09:17:11.7133220Z * [new branch] gh/yiming0416/1/base -> origin/gh/yiming0416/1/base 2025-12-04T09:17:11.7135160Z * [new branch] gh/yiming0416/1/head -> origin/gh/yiming0416/1/head 2025-12-04T09:17:11.7137550Z * [new branch] gh/yiming0416/2/base -> origin/gh/yiming0416/2/base 2025-12-04T09:17:11.7139321Z * [new branch] gh/yiming0416/2/head -> origin/gh/yiming0416/2/head 2025-12-04T09:17:11.7142471Z * [new branch] gh/yushangdi/1/base -> origin/gh/yushangdi/1/base 2025-12-04T09:17:11.7144474Z * [new branch] gh/yushangdi/1/head -> origin/gh/yushangdi/1/head 2025-12-04T09:17:11.7146890Z * [new branch] gh/yushangdi/10/base -> origin/gh/yushangdi/10/base 2025-12-04T09:17:11.7148682Z * [new branch] gh/yushangdi/10/head -> origin/gh/yushangdi/10/head 2025-12-04T09:17:11.7150536Z * [new branch] gh/yushangdi/10/orig -> origin/gh/yushangdi/10/orig 2025-12-04T09:17:11.7152967Z * [new branch] gh/yushangdi/11/base -> origin/gh/yushangdi/11/base 2025-12-04T09:17:11.7154769Z * [new branch] gh/yushangdi/11/head -> origin/gh/yushangdi/11/head 2025-12-04T09:17:11.7156610Z * [new branch] gh/yushangdi/11/orig -> origin/gh/yushangdi/11/orig 2025-12-04T09:17:11.7159143Z * [new branch] gh/yushangdi/2/base -> origin/gh/yushangdi/2/base 2025-12-04T09:17:11.7161045Z * [new branch] gh/yushangdi/2/head -> origin/gh/yushangdi/2/head 2025-12-04T09:17:11.7163571Z * [new branch] gh/yushangdi/7/base -> origin/gh/yushangdi/7/base 2025-12-04T09:17:11.7165368Z * [new branch] gh/yushangdi/7/head -> origin/gh/yushangdi/7/head 2025-12-04T09:17:11.7167595Z * [new branch] gh/yushangdi/7/orig -> origin/gh/yushangdi/7/orig 2025-12-04T09:17:11.7170019Z * [new branch] gh/yushangdi/8/base -> origin/gh/yushangdi/8/base 2025-12-04T09:17:11.7172015Z * [new branch] gh/yushangdi/8/head -> origin/gh/yushangdi/8/head 2025-12-04T09:17:11.7173846Z * [new branch] gh/yushangdi/8/orig -> origin/gh/yushangdi/8/orig 2025-12-04T09:17:11.7176199Z * [new branch] gh/yushangdi/9/base -> origin/gh/yushangdi/9/base 2025-12-04T09:17:11.7178063Z * [new branch] gh/yushangdi/9/head -> origin/gh/yushangdi/9/head 2025-12-04T09:17:11.7179957Z * [new branch] gh/yushangdi/9/orig -> origin/gh/yushangdi/9/orig 2025-12-04T09:17:11.7183138Z * [new branch] gh/zklaus/19/base -> origin/gh/zklaus/19/base 2025-12-04T09:17:11.7185275Z * [new branch] gh/zklaus/19/head -> origin/gh/zklaus/19/head 2025-12-04T09:17:11.7186821Z * [new branch] gh/zklaus/19/orig -> origin/gh/zklaus/19/orig 2025-12-04T09:17:11.7189400Z * [new branch] gh/zklaus/20/base -> origin/gh/zklaus/20/base 2025-12-04T09:17:11.7191243Z * [new branch] gh/zklaus/20/head -> origin/gh/zklaus/20/head 2025-12-04T09:17:11.7193054Z * [new branch] gh/zklaus/20/orig -> origin/gh/zklaus/20/orig 2025-12-04T09:17:11.7196067Z * [new branch] gh/zklaus/21/base -> origin/gh/zklaus/21/base 2025-12-04T09:17:11.7197906Z * [new branch] gh/zklaus/21/head -> origin/gh/zklaus/21/head 2025-12-04T09:17:11.7199842Z * [new branch] gh/zklaus/21/orig -> origin/gh/zklaus/21/orig 2025-12-04T09:17:11.7203684Z * [new branch] gh/zklaus/22/base -> origin/gh/zklaus/22/base 2025-12-04T09:17:11.7205278Z * [new branch] gh/zklaus/22/head -> origin/gh/zklaus/22/head 2025-12-04T09:17:11.7207076Z * [new branch] gh/zklaus/22/orig -> origin/gh/zklaus/22/orig 2025-12-04T09:17:11.7209785Z * [new branch] gh/zklaus/23/base -> origin/gh/zklaus/23/base 2025-12-04T09:17:11.7211612Z * [new branch] gh/zklaus/23/head -> origin/gh/zklaus/23/head 2025-12-04T09:17:11.7213473Z * [new branch] gh/zklaus/23/orig -> origin/gh/zklaus/23/orig 2025-12-04T09:17:11.7215887Z * [new branch] gh/zklaus/24/base -> origin/gh/zklaus/24/base 2025-12-04T09:17:11.7217684Z * [new branch] gh/zklaus/24/head -> origin/gh/zklaus/24/head 2025-12-04T09:17:11.7219539Z * [new branch] gh/zklaus/24/orig -> origin/gh/zklaus/24/orig 2025-12-04T09:17:11.7222878Z * [new branch] gh/zou3519/1197/base -> origin/gh/zou3519/1197/base 2025-12-04T09:17:11.7224509Z * [new branch] gh/zou3519/1197/head -> origin/gh/zou3519/1197/head 2025-12-04T09:17:11.7226265Z * [new branch] gh/zou3519/1197/orig -> origin/gh/zou3519/1197/orig 2025-12-04T09:17:11.7229181Z * [new branch] gh/zou3519/1199/base -> origin/gh/zou3519/1199/base 2025-12-04T09:17:11.7231063Z * [new branch] gh/zou3519/1199/head -> origin/gh/zou3519/1199/head 2025-12-04T09:17:11.7232969Z * [new branch] gh/zou3519/1199/orig -> origin/gh/zou3519/1199/orig 2025-12-04T09:17:11.7235423Z * [new branch] gh/zou3519/1200/base -> origin/gh/zou3519/1200/base 2025-12-04T09:17:11.7237317Z * [new branch] gh/zou3519/1200/head -> origin/gh/zou3519/1200/head 2025-12-04T09:17:11.7239241Z * [new branch] gh/zou3519/1200/orig -> origin/gh/zou3519/1200/orig 2025-12-04T09:17:11.7241900Z * [new branch] gh/zou3519/1201/base -> origin/gh/zou3519/1201/base 2025-12-04T09:17:11.7243588Z * [new branch] gh/zou3519/1201/head -> origin/gh/zou3519/1201/head 2025-12-04T09:17:11.7245354Z * [new branch] gh/zou3519/1201/orig -> origin/gh/zou3519/1201/orig 2025-12-04T09:17:11.7247689Z * [new branch] gh/zou3519/1202/base -> origin/gh/zou3519/1202/base 2025-12-04T09:17:11.7249631Z * [new branch] gh/zou3519/1202/head -> origin/gh/zou3519/1202/head 2025-12-04T09:17:11.7251482Z * [new branch] gh/zou3519/1202/orig -> origin/gh/zou3519/1202/orig 2025-12-04T09:17:11.7254488Z * [new branch] gh/zpcore/1/base -> origin/gh/zpcore/1/base 2025-12-04T09:17:11.7256990Z * [new branch] gh/zpcore/1/head -> origin/gh/zpcore/1/head 2025-12-04T09:17:11.7259655Z * [new branch] gh/zpcore/11/base -> origin/gh/zpcore/11/base 2025-12-04T09:17:11.7261504Z * [new branch] gh/zpcore/11/head -> origin/gh/zpcore/11/head 2025-12-04T09:17:11.7263378Z * [new branch] gh/zpcore/11/orig -> origin/gh/zpcore/11/orig 2025-12-04T09:17:11.7266277Z * [new branch] gh/zpcore/12/base -> origin/gh/zpcore/12/base 2025-12-04T09:17:11.7268147Z * [new branch] gh/zpcore/12/head -> origin/gh/zpcore/12/head 2025-12-04T09:17:11.7269999Z * [new branch] gh/zpcore/12/orig -> origin/gh/zpcore/12/orig 2025-12-04T09:17:11.7272615Z * [new branch] gh/zpcore/13/base -> origin/gh/zpcore/13/base 2025-12-04T09:17:11.7274335Z * [new branch] gh/zpcore/13/head -> origin/gh/zpcore/13/head 2025-12-04T09:17:11.7276128Z * [new branch] gh/zpcore/13/orig -> origin/gh/zpcore/13/orig 2025-12-04T09:17:11.7278729Z * [new branch] gh/zpcore/14/base -> origin/gh/zpcore/14/base 2025-12-04T09:17:11.7280687Z * [new branch] gh/zpcore/14/head -> origin/gh/zpcore/14/head 2025-12-04T09:17:11.7282635Z * [new branch] gh/zpcore/14/orig -> origin/gh/zpcore/14/orig 2025-12-04T09:17:11.7285330Z * [new branch] gh/zpcore/15/base -> origin/gh/zpcore/15/base 2025-12-04T09:17:11.7287152Z * [new branch] gh/zpcore/15/head -> origin/gh/zpcore/15/head 2025-12-04T09:17:11.7289037Z * [new branch] gh/zpcore/15/orig -> origin/gh/zpcore/15/orig 2025-12-04T09:17:11.7291516Z * [new branch] gh/zpcore/2/base -> origin/gh/zpcore/2/base 2025-12-04T09:17:11.7293387Z * [new branch] gh/zpcore/2/head -> origin/gh/zpcore/2/head 2025-12-04T09:17:11.7296433Z * [new branch] gh/zpcore/21/base -> origin/gh/zpcore/21/base 2025-12-04T09:17:11.7298472Z * [new branch] gh/zpcore/21/head -> origin/gh/zpcore/21/head 2025-12-04T09:17:11.7300157Z * [new branch] gh/zpcore/21/orig -> origin/gh/zpcore/21/orig 2025-12-04T09:17:11.7303145Z * [new branch] gh/zpcore/22/base -> origin/gh/zpcore/22/base 2025-12-04T09:17:11.7304866Z * [new branch] gh/zpcore/22/head -> origin/gh/zpcore/22/head 2025-12-04T09:17:11.7306776Z * [new branch] gh/zpcore/22/orig -> origin/gh/zpcore/22/orig 2025-12-04T09:17:11.7309582Z * [new branch] gh/zpcore/23/base -> origin/gh/zpcore/23/base 2025-12-04T09:17:11.7311475Z * [new branch] gh/zpcore/23/head -> origin/gh/zpcore/23/head 2025-12-04T09:17:11.7313385Z * [new branch] gh/zpcore/23/orig -> origin/gh/zpcore/23/orig 2025-12-04T09:17:11.7315708Z * [new branch] gh/zpcore/24/base -> origin/gh/zpcore/24/base 2025-12-04T09:17:11.7317584Z * [new branch] gh/zpcore/24/head -> origin/gh/zpcore/24/head 2025-12-04T09:17:11.7319542Z * [new branch] gh/zpcore/24/orig -> origin/gh/zpcore/24/orig 2025-12-04T09:17:11.7322434Z * [new branch] gh/zpcore/25/base -> origin/gh/zpcore/25/base 2025-12-04T09:17:11.7324155Z * [new branch] gh/zpcore/25/head -> origin/gh/zpcore/25/head 2025-12-04T09:17:11.7325966Z * [new branch] gh/zpcore/25/orig -> origin/gh/zpcore/25/orig 2025-12-04T09:17:11.7328555Z * [new branch] gh/zpcore/26/base -> origin/gh/zpcore/26/base 2025-12-04T09:17:11.7330427Z * [new branch] gh/zpcore/26/head -> origin/gh/zpcore/26/head 2025-12-04T09:17:11.7332265Z * [new branch] gh/zpcore/26/orig -> origin/gh/zpcore/26/orig 2025-12-04T09:17:11.7334994Z * [new branch] gh/zpcore/27/base -> origin/gh/zpcore/27/base 2025-12-04T09:17:11.7336842Z * [new branch] gh/zpcore/27/head -> origin/gh/zpcore/27/head 2025-12-04T09:17:11.7338715Z * [new branch] gh/zpcore/27/orig -> origin/gh/zpcore/27/orig 2025-12-04T09:17:11.7341692Z * [new branch] gh/zpcore/28/base -> origin/gh/zpcore/28/base 2025-12-04T09:17:11.7343945Z * [new branch] gh/zpcore/28/head -> origin/gh/zpcore/28/head 2025-12-04T09:17:11.7345827Z * [new branch] gh/zpcore/28/orig -> origin/gh/zpcore/28/orig 2025-12-04T09:17:11.7348141Z * [new branch] gh/zpcore/3/base -> origin/gh/zpcore/3/base 2025-12-04T09:17:11.7349935Z * [new branch] gh/zpcore/3/head -> origin/gh/zpcore/3/head 2025-12-04T09:17:11.7352264Z * [new branch] gh/zpcore/4/base -> origin/gh/zpcore/4/base 2025-12-04T09:17:11.7354547Z * [new branch] gh/zpcore/4/head -> origin/gh/zpcore/4/head 2025-12-04T09:17:11.7357502Z * [new branch] gh/zpcore/5/base -> origin/gh/zpcore/5/base 2025-12-04T09:17:11.7359404Z * [new branch] gh/zpcore/5/head -> origin/gh/zpcore/5/head 2025-12-04T09:17:11.7361961Z * [new branch] gh/zpcore/6/base -> origin/gh/zpcore/6/base 2025-12-04T09:17:11.7363710Z * [new branch] gh/zpcore/6/head -> origin/gh/zpcore/6/head 2025-12-04T09:17:11.7366624Z * [new branch] gh/zpcore/7/base -> origin/gh/zpcore/7/base 2025-12-04T09:17:11.7368480Z * [new branch] gh/zpcore/7/head -> origin/gh/zpcore/7/head 2025-12-04T09:17:11.7370896Z * [new branch] gh/zpcore/8/base -> origin/gh/zpcore/8/base 2025-12-04T09:17:11.7372710Z * [new branch] gh/zpcore/8/head -> origin/gh/zpcore/8/head 2025-12-04T09:17:11.7374652Z * [new branch] google-main -> origin/google-main 2025-12-04T09:17:11.7377282Z * [new branch] guangyey/external_stream -> origin/guangyey/external_stream 2025-12-04T09:17:11.7379093Z * [new branch] guangyey/test_2025 -> origin/guangyey/test_2025 2025-12-04T09:17:11.7381837Z * [new branch] guilhermeleobas/cherry-pick-55d87d9dfd9 -> origin/guilhermeleobas/cherry-pick-55d87d9dfd9 2025-12-04T09:17:11.7384331Z * [new branch] hameerabbasi/complex_tensor_subclass -> origin/hameerabbasi/complex_tensor_subclass 2025-12-04T09:17:11.7386226Z * [new branch] hameerabbasi/fix-ctensor-gradcheck-tests -> origin/hameerabbasi/fix-ctensor-gradcheck-tests 2025-12-04T09:17:11.7387876Z * [new branch] hameerabbasi/gradcheck-allclose -> origin/hameerabbasi/gradcheck-allclose 2025-12-04T09:17:11.7389662Z * [new branch] hc_baseline -> origin/hc_baseline 2025-12-04T09:17:11.7391588Z * [new branch] hhh_rand -> origin/hhh_rand 2025-12-04T09:17:11.7394012Z * [new branch] huba/f1 -> origin/huba/f1 2025-12-04T09:17:11.7396516Z * [new branch] increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test -> origin/increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test 2025-12-04T09:17:11.7398625Z * [new branch] inlining -> origin/inlining 2025-12-04T09:17:11.7401007Z * [new branch] inlining-ezyang -> origin/inlining-ezyang 2025-12-04T09:17:11.7403282Z * [new branch] install-torchao-0.13.0 -> origin/install-torchao-0.13.0 2025-12-04T09:17:11.7405396Z * [new branch] instrument-trunk-pull-linux-with-job-test-filters -> origin/instrument-trunk-pull-linux-with-job-test-filters 2025-12-04T09:17:11.7407059Z * [new branch] invoke-subgraph -> origin/invoke-subgraph 2025-12-04T09:17:11.7409109Z * [new branch] issue#58739 -> origin/issue#58739 2025-12-04T09:17:11.7411064Z * [new branch] jainapurva-patch-1 -> origin/jainapurva-patch-1 2025-12-04T09:17:11.7413545Z * [new branch] jathu/o3 -> origin/jathu/o3 2025-12-04T09:17:11.7415235Z * [new branch] jathu/sve -> origin/jathu/sve 2025-12-04T09:17:11.7418013Z * [new branch] jcaip/test-cusparselt-version-0.6.2 -> origin/jcaip/test-cusparselt-version-0.6.2 2025-12-04T09:17:11.7419702Z * [new branch] jcaip/update-cusparselt-0.6.2 -> origin/jcaip/update-cusparselt-0.6.2 2025-12-04T09:17:11.7422242Z * [new branch] jiannanWang/memorysnapshot_filter -> origin/jiannanWang/memorysnapshot_filter 2025-12-04T09:17:11.7424071Z * [new branch] jiannanWang/profilerstepwarning -> origin/jiannanWang/profilerstepwarning 2025-12-04T09:17:11.7426000Z * [new branch] jithunnair-amd-patch-1 -> origin/jithunnair-amd-patch-1 2025-12-04T09:17:11.7427937Z * [new branch] jithunnair-amd-patch-10 -> origin/jithunnair-amd-patch-10 2025-12-04T09:17:11.7429994Z * [new branch] jithunnair-amd-patch-2 -> origin/jithunnair-amd-patch-2 2025-12-04T09:17:11.7431925Z * [new branch] jithunnair-amd-patch-3 -> origin/jithunnair-amd-patch-3 2025-12-04T09:17:11.7433918Z * [new branch] jithunnair-amd-patch-4 -> origin/jithunnair-amd-patch-4 2025-12-04T09:17:11.7435842Z * [new branch] jithunnair-amd-patch-5 -> origin/jithunnair-amd-patch-5 2025-12-04T09:17:11.7437782Z * [new branch] jithunnair-amd-patch-6 -> origin/jithunnair-amd-patch-6 2025-12-04T09:17:11.7439907Z * [new branch] jithunnair-amd-patch-7 -> origin/jithunnair-amd-patch-7 2025-12-04T09:17:11.7441945Z * [new branch] jithunnair-amd-patch-8 -> origin/jithunnair-amd-patch-8 2025-12-04T09:17:11.7443844Z * [new branch] jithunnair-amd-patch-9 -> origin/jithunnair-amd-patch-9 2025-12-04T09:17:11.7446472Z * [new branch] justinchu/native-qdq -> origin/justinchu/native-qdq 2025-12-04T09:17:11.7449181Z * [new branch] kainan666/xlf_debug -> origin/kainan666/xlf_debug 2025-12-04T09:17:11.7450938Z * [new branch] kainan_test -> origin/kainan_test 2025-12-04T09:17:11.7452801Z * [new branch] larryliu0820-patch-1 -> origin/larryliu0820-patch-1 2025-12-04T09:17:11.7455428Z * [new branch] leslie/test_group_gemm_epilogues -> origin/leslie/test_group_gemm_epilogues 2025-12-04T09:17:11.7457929Z * [new branch] lessw2020/fix_cutlass_cache_error -> origin/lessw2020/fix_cutlass_cache_error 2025-12-04T09:17:11.7460461Z * [new branch] liaoxuan/shm_all_reduce -> origin/liaoxuan/shm_all_reduce 2025-12-04T09:17:11.7462260Z * [new branch] liaoxuan/test_fa_disable_softmax -> origin/liaoxuan/test_fa_disable_softmax 2025-12-04T09:17:11.7464133Z * [new branch] liaoxuan/test_int8_sdpa -> origin/liaoxuan/test_int8_sdpa 2025-12-04T09:17:11.7465982Z * [new branch] llama4-stable -> origin/llama4-stable 2025-12-04T09:17:11.7469111Z * [new branch] lts/release/1.8 -> origin/lts/release/1.8 2025-12-04T09:17:11.7471697Z * [new branch] lucaskabela/#94773 -> origin/lucaskabela/#94773 2025-12-04T09:17:11.7473423Z * [new branch] lucaskabela/fix_164876 -> origin/lucaskabela/fix_164876 2025-12-04T09:17:11.7475147Z * [new branch] lucaskabela/flop_counter -> origin/lucaskabela/flop_counter 2025-12-04T09:17:11.7476888Z * [new branch] lucaskabela/func_under_decomp -> origin/lucaskabela/func_under_decomp 2025-12-04T09:17:11.7478721Z * [new branch] lucaskabela/functional_in_dynamo -> origin/lucaskabela/functional_in_dynamo 2025-12-04T09:17:11.7480719Z * [new branch] lucaskabela/install_params_as_graph_attr -> origin/lucaskabela/install_params_as_graph_attr 2025-12-04T09:17:11.7482793Z * [new branch] lucaskabela/parameters_as_graph_attr -> origin/lucaskabela/parameters_as_graph_attr 2025-12-04T09:17:11.7485078Z * [new branch] lucaskabela/remove_aot_dispatcher_metadata -> origin/lucaskabela/remove_aot_dispatcher_metadata 2025-12-04T09:17:11.7486790Z * [new branch] lucaskabela/rnn_decomp -> origin/lucaskabela/rnn_decomp 2025-12-04T09:17:11.7488752Z * [new branch] lucaskabela/typing_backends -> origin/lucaskabela/typing_backends 2025-12-04T09:17:11.7490754Z * [new branch] lucaskabela/typing_ctx_manager -> origin/lucaskabela/typing_ctx_manager 2025-12-04T09:17:11.7492599Z * [new branch] lucaskabela/typing_nn_module -> origin/lucaskabela/typing_nn_module 2025-12-04T09:17:11.7494407Z * [new branch] lucaskabela/typing_user_defined -> origin/lucaskabela/typing_user_defined 2025-12-04T09:17:11.7496229Z * [new branch] lucaskabela/typing_variables -> origin/lucaskabela/typing_variables 2025-12-04T09:17:11.7498072Z * [new branch] lucaskabela/typing_variables_dicts -> origin/lucaskabela/typing_variables_dicts 2025-12-04T09:17:11.7500045Z * [new branch] lucaskabela/typing_variables_functions -> origin/lucaskabela/typing_variables_functions 2025-12-04T09:17:11.7502096Z * [new branch] lucaskabela/typing_variables_lists -> origin/lucaskabela/typing_variables_lists 2025-12-04T09:17:11.7504569Z * [new branch] lw/torch_box_by_ref -> origin/lw/torch_box_by_ref 2025-12-04T09:17:11.7506501Z * [new branch] main -> origin/main 2025-12-04T09:17:11.7508663Z * [new branch] malfet-patch-1 -> origin/malfet-patch-1 2025-12-04T09:17:11.7510677Z * [new branch] malfet-patch-2 -> origin/malfet-patch-2 2025-12-04T09:17:11.7512656Z * [new branch] malfet-patch-3 -> origin/malfet-patch-3 2025-12-04T09:17:11.7514805Z * [new branch] malfet-patch-4 -> origin/malfet-patch-4 2025-12-04T09:17:11.7516652Z * [new branch] malfet-patch-5 -> origin/malfet-patch-5 2025-12-04T09:17:11.7518635Z * [new branch] malfet-patch-6 -> origin/malfet-patch-6 2025-12-04T09:17:11.7520872Z * [new branch] malfet-patch-7 -> origin/malfet-patch-7 2025-12-04T09:17:11.7522906Z * [new branch] malfet-patch-8 -> origin/malfet-patch-8 2025-12-04T09:17:11.7525907Z * [new branch] malfet/add-3.14-ci -> origin/malfet/add-3.14-ci 2025-12-04T09:17:11.7527952Z * [new branch] malfet/be-do-not-make-typos-in-build-artifacts -> origin/malfet/be-do-not-make-typos-in-build-artifacts 2025-12-04T09:17:11.7529788Z * [new branch] malfet/be-move-more-settings-to-checkout-pytorch -> origin/malfet/be-move-more-settings-to-checkout-pytorch 2025-12-04T09:17:11.7531779Z * [new branch] malfet/be-remove-misisng-neon-headers -> origin/malfet/be-remove-misisng-neon-headers 2025-12-04T09:17:11.7533767Z * [new branch] malfet/mps-implement-col2im -> origin/malfet/mps-implement-col2im 2025-12-04T09:17:11.7536427Z * [new branch] manuel/aoti_metal_shimify-thread_safe -> origin/manuel/aoti_metal_shimify-thread_safe 2025-12-04T09:17:11.7538103Z * [new branch] manuel/inductor_link_openmp -> origin/manuel/inductor_link_openmp 2025-12-04T09:17:11.7540509Z * [new branch] masnesral/metaconda -> origin/masnesral/metaconda 2025-12-04T09:17:11.7542593Z * [new branch] mem_profiler_flaky_fix -> origin/mem_profiler_flaky_fix 2025-12-04T09:17:11.7544463Z * [new branch] mem_profiler_stack_trace -> origin/mem_profiler_stack_trace 2025-12-04T09:17:11.7546443Z * [new branch] memory_profiler_stack -> origin/memory_profiler_stack 2025-12-04T09:17:11.7549056Z * [new branch] metascroy-patch-1 -> origin/metascroy-patch-1 2025-12-04T09:17:11.7551047Z * [new branch] mingw_posix -> origin/mingw_posix 2025-12-04T09:17:11.7553635Z * [new branch] mlazos/S429861-debug -> origin/mlazos/S429861-debug 2025-12-04T09:17:11.7555271Z * [new branch] mlazos/aa -> origin/mlazos/aa 2025-12-04T09:17:11.7557009Z * [new branch] mlazos/acts -> origin/mlazos/acts 2025-12-04T09:17:11.7558848Z * [new branch] mlazos/arg-renames -> origin/mlazos/arg-renames 2025-12-04T09:17:11.7560697Z * [new branch] mlazos/bad-cudagraphs -> origin/mlazos/bad-cudagraphs 2025-12-04T09:17:11.7562481Z * [new branch] mlazos/baseline-graph-breaks -> origin/mlazos/baseline-graph-breaks 2025-12-04T09:17:11.7564224Z * [new branch] mlazos/beta-tensor -> origin/mlazos/beta-tensor 2025-12-04T09:17:11.7566512Z * [new branch] mlazos/buffers -> origin/mlazos/buffers 2025-12-04T09:17:11.7568142Z * [new branch] mlazos/buffers2 -> origin/mlazos/buffers2 2025-12-04T09:17:11.7570212Z * [new branch] mlazos/buffers3 -> origin/mlazos/buffers3 2025-12-04T09:17:11.7572306Z * [new branch] mlazos/bwd -> origin/mlazos/bwd 2025-12-04T09:17:11.7574101Z * [new branch] mlazos/combo-test -> origin/mlazos/combo-test 2025-12-04T09:17:11.7575994Z * [new branch] mlazos/ctx-cleanup -> origin/mlazos/ctx-cleanup 2025-12-04T09:17:11.7577979Z * [new branch] mlazos/cuda-cmd-log -> origin/mlazos/cuda-cmd-log 2025-12-04T09:17:11.7580050Z * [new branch] mlazos/cudagraph-tests -> origin/mlazos/cudagraph-tests 2025-12-04T09:17:11.7581942Z * [new branch] mlazos/cudagraphs-measurement -> origin/mlazos/cudagraphs-measurement 2025-12-04T09:17:11.7583843Z * [new branch] mlazos/cutlass-test -> origin/mlazos/cutlass-test 2025-12-04T09:17:11.7585827Z * [new branch] mlazos/cutlass-topo-bug -> origin/mlazos/cutlass-topo-bug 2025-12-04T09:17:11.7587618Z * [new branch] mlazos/dataclass-proxy -> origin/mlazos/dataclass-proxy 2025-12-04T09:17:11.7589417Z * [new branch] mlazos/dc-attrs -> origin/mlazos/dc-attrs 2025-12-04T09:17:11.7591310Z * [new branch] mlazos/dc-helion -> origin/mlazos/dc-helion 2025-12-04T09:17:11.7593133Z * [new branch] mlazos/dict-fix -> origin/mlazos/dict-fix 2025-12-04T09:17:11.7594951Z * [new branch] mlazos/disable-tf -> origin/mlazos/disable-tf 2025-12-04T09:17:11.7596779Z * [new branch] mlazos/dupe-fix -> origin/mlazos/dupe-fix 2025-12-04T09:17:11.7598729Z * [new branch] mlazos/dyn-batch -> origin/mlazos/dyn-batch 2025-12-04T09:17:11.7600714Z * [new branch] mlazos/evt -> origin/mlazos/evt 2025-12-04T09:17:11.7602963Z * [new branch] mlazos/extract-examples -> origin/mlazos/extract-examples 2025-12-04T09:17:11.7604691Z * [new branch] mlazos/foreach-op -> origin/mlazos/foreach-op 2025-12-04T09:17:11.7606489Z * [new branch] mlazos/fp8 -> origin/mlazos/fp8 2025-12-04T09:17:11.7608615Z * [new branch] mlazos/fp8-bias -> origin/mlazos/fp8-bias 2025-12-04T09:17:11.7610465Z * [new branch] mlazos/fp8-bias-fusion -> origin/mlazos/fp8-bias-fusion 2025-12-04T09:17:11.7612280Z * [new branch] mlazos/fp8-fixes -> origin/mlazos/fp8-fixes 2025-12-04T09:17:11.7614110Z * [new branch] mlazos/freezing -> origin/mlazos/freezing 2025-12-04T09:17:11.7616031Z * [new branch] mlazos/h-comp -> origin/mlazos/h-comp 2025-12-04T09:17:11.7617920Z * [new branch] mlazos/h-comp2 -> origin/mlazos/h-comp2 2025-12-04T09:17:11.7619911Z * [new branch] mlazos/hash-hop -> origin/mlazos/hash-hop 2025-12-04T09:17:11.7621748Z * [new branch] mlazos/hc -> origin/mlazos/hc 2025-12-04T09:17:11.7624150Z * [new branch] mlazos/hc-cycles -> origin/mlazos/hc-cycles 2025-12-04T09:17:11.7626003Z * [new branch] mlazos/hc-fixes -> origin/mlazos/hc-fixes 2025-12-04T09:17:11.7627860Z * [new branch] mlazos/hc-fixes3 -> origin/mlazos/hc-fixes3 2025-12-04T09:17:11.7629753Z * [new branch] mlazos/hc-fixes4 -> origin/mlazos/hc-fixes4 2025-12-04T09:17:11.7631655Z * [new branch] mlazos/hc-hf -> origin/mlazos/hc-hf 2025-12-04T09:17:11.7633403Z * [new branch] mlazos/hc-mut -> origin/mlazos/hc-mut 2025-12-04T09:17:11.7635270Z * [new branch] mlazos/hc10 -> origin/mlazos/hc10 2025-12-04T09:17:11.7637100Z * [new branch] mlazos/hc11 -> origin/mlazos/hc11 2025-12-04T09:17:11.7639094Z * [new branch] mlazos/hc12 -> origin/mlazos/hc12 2025-12-04T09:17:11.7641078Z * [new branch] mlazos/hc13 -> origin/mlazos/hc13 2025-12-04T09:17:11.7642959Z * [new branch] mlazos/hc14 -> origin/mlazos/hc14 2025-12-04T09:17:11.7644746Z * [new branch] mlazos/hc15 -> origin/mlazos/hc15 2025-12-04T09:17:11.7646578Z * [new branch] mlazos/hc2 -> origin/mlazos/hc2 2025-12-04T09:17:11.7648544Z * [new branch] mlazos/hc4 -> origin/mlazos/hc4 2025-12-04T09:17:11.7650337Z * [new branch] mlazos/hc5 -> origin/mlazos/hc5 2025-12-04T09:17:11.7652131Z * [new branch] mlazos/hc6 -> origin/mlazos/hc6 2025-12-04T09:17:11.7654085Z * [new branch] mlazos/hc7 -> origin/mlazos/hc7 2025-12-04T09:17:11.7655921Z * [new branch] mlazos/hc8 -> origin/mlazos/hc8 2025-12-04T09:17:11.7657693Z * [new branch] mlazos/hc9 -> origin/mlazos/hc9 2025-12-04T09:17:11.7659531Z * [new branch] mlazos/hc_baseline2 -> origin/mlazos/hc_baseline2 2025-12-04T09:17:11.7661341Z * [new branch] mlazos/inductor-streams -> origin/mlazos/inductor-streams 2025-12-04T09:17:11.7663032Z * [new branch] mlazos/main -> origin/mlazos/main 2025-12-04T09:17:11.7664986Z * [new branch] mlazos/mcg2 -> origin/mlazos/mcg2 2025-12-04T09:17:11.7666824Z * [new branch] mlazos/meta-guards -> origin/mlazos/meta-guards 2025-12-04T09:17:11.7669542Z * [new branch] mlazos/mlazos/foreach-map-adam -> origin/mlazos/mlazos/foreach-map-adam 2025-12-04T09:17:11.7671391Z * [new branch] mlazos/mlazos/tf-mode-backup -> origin/mlazos/mlazos/tf-mode-backup 2025-12-04T09:17:11.7673166Z * [new branch] mlazos/mod-fix -> origin/mlazos/mod-fix 2025-12-04T09:17:11.7675064Z * [new branch] mlazos/mode-fix -> origin/mlazos/mode-fix 2025-12-04T09:17:11.7676908Z * [new branch] mlazos/offsets -> origin/mlazos/offsets 2025-12-04T09:17:11.7678745Z * [new branch] mlazos/overguarding -> origin/mlazos/overguarding 2025-12-04T09:17:11.7680740Z * [new branch] mlazos/proxy-ctors -> origin/mlazos/proxy-ctors 2025-12-04T09:17:11.7682579Z * [new branch] mlazos/quant-fix -> origin/mlazos/quant-fix 2025-12-04T09:17:11.7684474Z * [new branch] mlazos/resnet-fix -> origin/mlazos/resnet-fix 2025-12-04T09:17:11.7686371Z * [new branch] mlazos/rm-buf-names -> origin/mlazos/rm-buf-names 2025-12-04T09:17:11.7688205Z * [new branch] mlazos/rm-code -> origin/mlazos/rm-code 2025-12-04T09:17:11.7690048Z * [new branch] mlazos/rm-spam -> origin/mlazos/rm-spam 2025-12-04T09:17:11.7692367Z * [new branch] mlazos/rtp -> origin/mlazos/rtp 2025-12-04T09:17:11.7694288Z * [new branch] mlazos/static-idx-dbg -> origin/mlazos/static-idx-dbg 2025-12-04T09:17:11.7696171Z * [new branch] mlazos/static-inputs-log -> origin/mlazos/static-inputs-log 2025-12-04T09:17:11.7698009Z * [new branch] mlazos/stests -> origin/mlazos/stests 2025-12-04T09:17:11.7699967Z * [new branch] mlazos/stream-ops -> origin/mlazos/stream-ops 2025-12-04T09:17:11.7703689Z * [new branch] mlazos/td-fix2 -> origin/mlazos/td-fix2 2025-12-04T09:17:11.7705606Z * [new branch] mlazos/tensor-hasattr2 -> origin/mlazos/tensor-hasattr2 2025-12-04T09:17:11.7707604Z * [new branch] mlazos/test -> origin/mlazos/test 2025-12-04T09:17:11.7709550Z * [new branch] mlazos/tf-mode -> origin/mlazos/tf-mode 2025-12-04T09:17:11.7711467Z * [new branch] mlazos/tf-mode-backup2 -> origin/mlazos/tf-mode-backup2 2025-12-04T09:17:11.7713350Z * [new branch] mlazos/tf-mode-reland -> origin/mlazos/tf-mode-reland 2025-12-04T09:17:11.7715252Z * [new branch] mlazos/tf-mode-reland2 -> origin/mlazos/tf-mode-reland2 2025-12-04T09:17:11.7717117Z * [new branch] mlazos/tf-mode-reland3 -> origin/mlazos/tf-mode-reland3 2025-12-04T09:17:11.7718920Z * [new branch] mlazos/triton-no-epi -> origin/mlazos/triton-no-epi 2025-12-04T09:17:11.7721049Z * [new branch] mlazos/tune-proto -> origin/mlazos/tune-proto 2025-12-04T09:17:11.7722935Z * [new branch] mlazos/tuple-fixes -> origin/mlazos/tuple-fixes 2025-12-04T09:17:11.7724981Z * [new branch] mlazos/tuple-fixes2 -> origin/mlazos/tuple-fixes2 2025-12-04T09:17:11.7726759Z * [new branch] mlazos/tuple-handling -> origin/mlazos/tuple-handling 2025-12-04T09:17:11.7728756Z * [new branch] mlazos/user-stream-base -> origin/mlazos/user-stream-base 2025-12-04T09:17:11.7730703Z * [new branch] mlazos/user-streams -> origin/mlazos/user-streams 2025-12-04T09:17:11.7732556Z * [new branch] mlazos/user-streams-backup -> origin/mlazos/user-streams-backup 2025-12-04T09:17:11.7734387Z * [new branch] mlazos/user-streams-backup2 -> origin/mlazos/user-streams-backup2 2025-12-04T09:17:11.7736280Z * [new branch] mlazos/vary-beta -> origin/mlazos/vary-beta 2025-12-04T09:17:11.7738148Z * [new branch] mlazos/vary-beta2 -> origin/mlazos/vary-beta2 2025-12-04T09:17:11.7740054Z * [new branch] mlazos/weird-perf1 -> origin/mlazos/weird-perf1 2025-12-04T09:17:11.7742048Z * [new branch] mm_out_dtype_compile -> origin/mm_out_dtype_compile 2025-12-04T09:17:11.7743934Z * [new branch] module-shim -> origin/module-shim 2025-12-04T09:17:11.7745972Z * [new branch] move_config -> origin/move_config 2025-12-04T09:17:11.7749002Z * [new branch] msaroufim/reduce -> origin/msaroufim/reduce 2025-12-04T09:17:11.7751484Z * [new branch] mtia/basic-cmake -> origin/mtia/basic-cmake 2025-12-04T09:17:11.7753986Z * [new branch] mwizak/fix-triton-block-shape -> origin/mwizak/fix-triton-block-shape 2025-12-04T09:17:11.7755843Z * [new branch] my_varlen_backup -> origin/my_varlen_backup 2025-12-04T09:17:11.7757921Z * [new branch] nativert_num_outputs -> origin/nativert_num_outputs 2025-12-04T09:17:11.7760038Z * [new branch] new-codegen -> origin/new-codegen 2025-12-04T09:17:11.7762012Z * [new branch] newtest-base -> origin/newtest-base 2025-12-04T09:17:11.7764520Z * [new branch] ngimel/addmm_dtype -> origin/ngimel/addmm_dtype 2025-12-04T09:17:11.7766207Z * [new branch] ngimel/div_inv -> origin/ngimel/div_inv 2025-12-04T09:17:11.7767991Z * [new branch] ngimel/error_index_list -> origin/ngimel/error_index_list 2025-12-04T09:17:11.7769786Z * [new branch] ngimel/gather_grid -> origin/ngimel/gather_grid 2025-12-04T09:17:11.7771555Z * [new branch] ngimel/gather_grid_release -> origin/ngimel/gather_grid_release 2025-12-04T09:17:11.7773199Z * [new branch] ngimel/gg_new -> origin/ngimel/gg_new 2025-12-04T09:17:11.7774947Z * [new branch] ngimel/hostalloc -> origin/ngimel/hostalloc 2025-12-04T09:17:11.7776626Z * [new branch] ngimel/storage_id -> origin/ngimel/storage_id 2025-12-04T09:17:11.7778558Z * [new branch] nightly -> origin/nightly 2025-12-04T09:17:11.7781218Z * [new branch] nikitaved/addmm_1_rowcol_lt_path_check -> origin/nikitaved/addmm_1_rowcol_lt_path_check 2025-12-04T09:17:11.7783104Z * [new branch] nikitaved/addmm_epilogue_fusions_2d_bias -> origin/nikitaved/addmm_epilogue_fusions_2d_bias 2025-12-04T09:17:11.7785006Z * [new branch] nikitaved/addmm_epilogue_fusions_inductor -> origin/nikitaved/addmm_epilogue_fusions_inductor 2025-12-04T09:17:11.7787101Z * [new branch] nikitaved/addmm_epilogue_fusions_scratch -> origin/nikitaved/addmm_epilogue_fusions_scratch 2025-12-04T09:17:11.7789230Z * [new branch] nikitaved/grad_addmm_epilogue_fusions -> origin/nikitaved/grad_addmm_epilogue_fusions 2025-12-04T09:17:11.7791509Z * [new branch] nikitaved/simpler_can_use_32bit_index -> origin/nikitaved/simpler_can_use_32bit_index 2025-12-04T09:17:11.7793359Z * [new branch] nikitaved/test -> origin/nikitaved/test 2025-12-04T09:17:11.7795598Z * [new branch] nmacchioni-perf-test-async-autotune -> origin/nmacchioni-perf-test-async-autotune 2025-12-04T09:17:11.7797455Z * [new branch] no_distributed_log_spew -> origin/no_distributed_log_spew 2025-12-04T09:17:11.7799513Z * [new branch] nofun-hack -> origin/nofun-hack 2025-12-04T09:17:11.7801805Z * [new branch] norm_bench -> origin/norm_bench 2025-12-04T09:17:11.7804307Z * [new branch] nullplay/fuse_matmul -> origin/nullplay/fuse_matmul 2025-12-04T09:17:11.7806267Z * [new branch] nullplay_fuse_matmul -> origin/nullplay_fuse_matmul 2025-12-04T09:17:11.7808198Z * [new branch] optimizer_test -> origin/optimizer_test 2025-12-04T09:17:11.7811397Z * [new branch] orig/release/1.10 -> origin/orig/release/1.10 2025-12-04T09:17:11.7813438Z * [new branch] orig/release/1.11 -> origin/orig/release/1.11 2025-12-04T09:17:11.7815322Z * [new branch] orig/release/1.12 -> origin/orig/release/1.12 2025-12-04T09:17:11.7817372Z * [new branch] orig/release/1.13 -> origin/orig/release/1.13 2025-12-04T09:17:11.7819312Z * [new branch] orig/release/1.6 -> origin/orig/release/1.6 2025-12-04T09:17:11.7821248Z * [new branch] orig/release/1.7 -> origin/orig/release/1.7 2025-12-04T09:17:11.7823168Z * [new branch] orig/release/1.8 -> origin/orig/release/1.8 2025-12-04T09:17:11.7825120Z * [new branch] orig/release/1.9 -> origin/orig/release/1.9 2025-12-04T09:17:11.7827045Z * [new branch] orig/release/2.0 -> origin/orig/release/2.0 2025-12-04T09:17:11.7828930Z * [new branch] orig/release/2.1 -> origin/orig/release/2.1 2025-12-04T09:17:11.7830849Z * [new branch] orig/release/2.2 -> origin/orig/release/2.2 2025-12-04T09:17:11.7832642Z * [new branch] orig/release/2.3 -> origin/orig/release/2.3 2025-12-04T09:17:11.7834421Z * [new branch] orig/release/2.4 -> origin/orig/release/2.4 2025-12-04T09:17:11.7836255Z * [new branch] orig/release/2.5 -> origin/orig/release/2.5 2025-12-04T09:17:11.7838095Z * [new branch] orig/release/2.6 -> origin/orig/release/2.6 2025-12-04T09:17:11.7840629Z * [new branch] orig/release/2.7 -> origin/orig/release/2.7 2025-12-04T09:17:11.7843056Z * [new branch] orig/release/2.8 -> origin/orig/release/2.8 2025-12-04T09:17:11.7844979Z * [new branch] orig/release/2.9 -> origin/orig/release/2.9 2025-12-04T09:17:11.7849622Z * [new branch] origin/gh/fxdawnn/1/base -> origin/origin/gh/fxdawnn/1/base 2025-12-04T09:17:11.7851360Z * [new branch] origin/gh/fxdawnn/1/orig -> origin/origin/gh/fxdawnn/1/orig 2025-12-04T09:17:11.7854431Z * [new branch] origin/gh/zpcore/14/orig -> origin/origin/gh/zpcore/14/orig 2025-12-04T09:17:11.7856455Z * [new branch] oulgen-patch-1 -> origin/oulgen-patch-1 2025-12-04T09:17:11.7858931Z * [new branch] oulgen-patch-2 -> origin/oulgen-patch-2 2025-12-04T09:17:11.7860569Z * [new branch] oulgen-patch-3 -> origin/oulgen-patch-3 2025-12-04T09:17:11.7862584Z * [new branch] oulgen-patch-4 -> origin/oulgen-patch-4 2025-12-04T09:17:11.7864489Z * [new branch] padded-tensor -> origin/padded-tensor 2025-12-04T09:17:11.7866492Z * [new branch] pca2 -> origin/pca2 2025-12-04T09:17:11.7868645Z * [new branch] per_channel_backup -> origin/per_channel_backup 2025-12-04T09:17:11.7870815Z * [new branch] perf_ops -> origin/perf_ops 2025-12-04T09:17:11.7872592Z * [new branch] perf_ops_2_9 -> origin/perf_ops_2_9 2025-12-04T09:17:11.7874763Z * [new branch] pianpwk-patch-1 -> origin/pianpwk-patch-1 2025-12-04T09:17:11.7877292Z * [new branch] pianpwk/__draft_debug_mode -> origin/pianpwk/__draft_debug_mode 2025-12-04T09:17:11.7879148Z * [new branch] pianpwk/_debug_mode_for_triton_draft -> origin/pianpwk/_debug_mode_for_triton_draft 2025-12-04T09:17:11.7881086Z * [new branch] pianpwk/_debug_nn_module_compile -> origin/pianpwk/_debug_nn_module_compile 2025-12-04T09:17:11.7882988Z * [new branch] pianpwk/_draft_triton_11_3 -> origin/pianpwk/_draft_triton_11_3 2025-12-04T09:17:11.7884990Z * [new branch] pianpwk/_manual_bucket_draft -> origin/pianpwk/_manual_bucket_draft 2025-12-04T09:17:11.7887219Z * [new branch] pianpwk/_profile_w_dispatch_keys -> origin/pianpwk/_profile_w_dispatch_keys 2025-12-04T09:17:11.7889460Z * [new branch] pianpwk/_super_draft_debug_mode -> origin/pianpwk/_super_draft_debug_mode 2025-12-04T09:17:11.7891402Z * [new branch] pianpwk/_unbacked_local_shard_size -> origin/pianpwk/_unbacked_local_shard_size 2025-12-04T09:17:11.7893137Z * [new branch] pianpwk/anomaly_tb -> origin/pianpwk/anomaly_tb 2025-12-04T09:17:11.7894968Z * [new branch] pianpwk/auto_fx_annotate -> origin/pianpwk/auto_fx_annotate 2025-12-04T09:17:11.7896895Z * [new branch] pianpwk/backed_size_oblivious_export -> origin/pianpwk/backed_size_oblivious_export 2025-12-04T09:17:11.7898855Z * [new branch] pianpwk/bert_dynamic_perf -> origin/pianpwk/bert_dynamic_perf 2025-12-04T09:17:11.7900999Z * [new branch] pianpwk/debug_fwd_stack_traces -> origin/pianpwk/debug_fwd_stack_traces 2025-12-04T09:17:11.7904153Z * [new branch] pianpwk/debug_hash_tensor -> origin/pianpwk/debug_hash_tensor 2025-12-04T09:17:11.7906082Z * [new branch] pianpwk/debug_mode_annotate -> origin/pianpwk/debug_mode_annotate 2025-12-04T09:17:11.7907815Z * [new branch] pianpwk/debug_mode_defaults -> origin/pianpwk/debug_mode_defaults 2025-12-04T09:17:11.7909782Z * [new branch] pianpwk/debug_mode_hacks -> origin/pianpwk/debug_mode_hacks 2025-12-04T09:17:11.7911593Z * [new branch] pianpwk/debug_mode_opcall_refactor -> origin/pianpwk/debug_mode_opcall_refactor 2025-12-04T09:17:11.7913378Z * [new branch] pianpwk/debug_mode_show_ids -> origin/pianpwk/debug_mode_show_ids 2025-12-04T09:17:11.7915208Z * [new branch] pianpwk/debug_mode_triton -> origin/pianpwk/debug_mode_triton 2025-12-04T09:17:11.7917147Z * [new branch] pianpwk/debug_show_stack_trace -> origin/pianpwk/debug_show_stack_trace 2025-12-04T09:17:11.7919098Z * [new branch] pianpwk/debug_wait_on_collective -> origin/pianpwk/debug_wait_on_collective 2025-12-04T09:17:11.7921359Z * [new branch] pianpwk/debugmode_compile_tf -> origin/pianpwk/debugmode_compile_tf 2025-12-04T09:17:11.7923271Z * [new branch] pianpwk/dispatch_key_debugging_for_debug -> origin/pianpwk/dispatch_key_debugging_for_debug 2025-12-04T09:17:11.7925017Z * [new branch] pianpwk/draft_debug_mode_tfcompile -> origin/pianpwk/draft_debug_mode_tfcompile 2025-12-04T09:17:11.7926875Z * [new branch] pianpwk/draft_multikernel_nn -> origin/pianpwk/draft_multikernel_nn 2025-12-04T09:17:11.7928849Z * [new branch] pianpwk/draft_multikernel_status_10_5 -> origin/pianpwk/draft_multikernel_status_10_5 2025-12-04T09:17:11.7930915Z * [new branch] pianpwk/dtensor_custom_chunk -> origin/pianpwk/dtensor_custom_chunk 2025-12-04T09:17:11.7932840Z * [new branch] pianpwk/dtensor_unbacked_keypath -> origin/pianpwk/dtensor_unbacked_keypath 2025-12-04T09:17:11.7934835Z * [new branch] pianpwk/event_list_tree -> origin/pianpwk/event_list_tree 2025-12-04T09:17:11.7936682Z * [new branch] pianpwk/false_numel_refs -> origin/pianpwk/false_numel_refs 2025-12-04T09:17:11.7938556Z * [new branch] pianpwk/maybe_guard_rel -> origin/pianpwk/maybe_guard_rel 2025-12-04T09:17:11.7940498Z * [new branch] pianpwk/multikernel_hints_draft -> origin/pianpwk/multikernel_hints_draft 2025-12-04T09:17:11.7942405Z * [new branch] pianpwk/no_size_oblivious_slice_scat -> origin/pianpwk/no_size_oblivious_slice_scat 2025-12-04T09:17:11.7944346Z * [new branch] pianpwk/oblivious_reshape_view_better -> origin/pianpwk/oblivious_reshape_view_better 2025-12-04T09:17:11.7946062Z * [new branch] pianpwk/pre_forward_hook -> origin/pianpwk/pre_forward_hook 2025-12-04T09:17:11.7947931Z * [new branch] pianpwk/skip_python_keys_alternate -> origin/pianpwk/skip_python_keys_alternate 2025-12-04T09:17:11.7949871Z * [new branch] pianpwk/skip_python_keys_in_guards -> origin/pianpwk/skip_python_keys_in_guards 2025-12-04T09:17:11.7951625Z * [new branch] pianpwk/sym_tokens_draft -> origin/pianpwk/sym_tokens_draft 2025-12-04T09:17:11.7953475Z * [new branch] pianpwk/symint_one_hot -> origin/pianpwk/symint_one_hot 2025-12-04T09:17:11.7955538Z * [new branch] pianpwk/test_pointwise_guard_or_false -> origin/pianpwk/test_pointwise_guard_or_false 2025-12-04T09:17:11.7957306Z * [new branch] pianpwk/totally_draft_sym_wrap -> origin/pianpwk/totally_draft_sym_wrap 2025-12-04T09:17:11.7959131Z * [new branch] pianpwk/try_dumb_stuff -> origin/pianpwk/try_dumb_stuff 2025-12-04T09:17:11.7961218Z * [new branch] pianpwk/try_dumb_stuff_2 -> origin/pianpwk/try_dumb_stuff_2 2025-12-04T09:17:11.7963132Z * [new branch] pianpwk/unbacked_dtensor_mm -> origin/pianpwk/unbacked_dtensor_mm 2025-12-04T09:17:11.7964990Z * [new branch] pianpwk/unbacked_tracing_12_2 -> origin/pianpwk/unbacked_tracing_12_2 2025-12-04T09:17:11.7966802Z * [new branch] pianpwk/user_symints -> origin/pianpwk/user_symints 2025-12-04T09:17:11.7968707Z * [new branch] pianpwk/wan21_reshape -> origin/pianpwk/wan21_reshape 2025-12-04T09:17:11.7971248Z * [new branch] piz/fix_partial_backward_1112 -> origin/piz/fix_partial_backward_1112 2025-12-04T09:17:11.7972918Z * [new branch] piz/prop_cache_clean -> origin/piz/prop_cache_clean 2025-12-04T09:17:11.7974866Z * [new branch] pool-separate -> origin/pool-separate 2025-12-04T09:17:11.7976851Z * [new branch] pr-156087 -> origin/pr-156087 2025-12-04T09:17:11.7979486Z * [new branch] pr/131860 -> origin/pr/131860 2025-12-04T09:17:11.7981456Z * [new branch] predispatch_to -> origin/predispatch_to 2025-12-04T09:17:11.7983835Z * [new branch] protect-c17 -> origin/protect-c17 2025-12-04T09:17:11.7985839Z * [new branch] pt-opt-cuda3 -> origin/pt-opt-cuda3 2025-12-04T09:17:11.7988384Z * [new branch] python_compiled_autograd -> origin/python_compiled_autograd 2025-12-04T09:17:11.7991198Z * [new branch] q1l1/fix_device_moved_constant_type_unknown -> origin/q1l1/fix_device_moved_constant_type_unknown 2025-12-04T09:17:11.7992996Z * [new branch] q1l1/fix_wrong_default_type_for_kernel_call_args -> origin/q1l1/fix_wrong_default_type_for_kernel_call_args 2025-12-04T09:17:11.7995788Z * [new branch] qchip/export-D54134695 -> origin/qchip/export-D54134695 2025-12-04T09:17:11.7997843Z * [new branch] quote-pytest_cache -> origin/quote-pytest_cache 2025-12-04T09:17:11.8000444Z * [new branch] reland-accgrad-stream-warn -> origin/reland-accgrad-stream-warn 2025-12-04T09:17:11.8003284Z * [new branch] release/1.10 -> origin/release/1.10 2025-12-04T09:17:11.8004945Z * [new branch] release/1.11 -> origin/release/1.11 2025-12-04T09:17:11.8006804Z * [new branch] release/1.12 -> origin/release/1.12 2025-12-04T09:17:11.8008894Z * [new branch] release/1.13 -> origin/release/1.13 2025-12-04T09:17:11.8010481Z * [new branch] release/1.4 -> origin/release/1.4 2025-12-04T09:17:11.8012561Z * [new branch] release/1.4.1 -> origin/release/1.4.1 2025-12-04T09:17:11.8014408Z * [new branch] release/1.5 -> origin/release/1.5 2025-12-04T09:17:11.8016399Z * [new branch] release/1.6 -> origin/release/1.6 2025-12-04T09:17:11.8018452Z * [new branch] release/1.7 -> origin/release/1.7 2025-12-04T09:17:11.8020708Z * [new branch] release/1.8 -> origin/release/1.8 2025-12-04T09:17:11.8022560Z * [new branch] release/1.9 -> origin/release/1.9 2025-12-04T09:17:11.8024435Z * [new branch] release/2.0 -> origin/release/2.0 2025-12-04T09:17:11.8026323Z * [new branch] release/2.1 -> origin/release/2.1 2025-12-04T09:17:11.8028202Z * [new branch] release/2.2 -> origin/release/2.2 2025-12-04T09:17:11.8030362Z * [new branch] release/2.3 -> origin/release/2.3 2025-12-04T09:17:11.8032752Z * [new branch] release/2.4 -> origin/release/2.4 2025-12-04T09:17:11.8035080Z * [new branch] release/2.5 -> origin/release/2.5 2025-12-04T09:17:11.8037189Z * [new branch] release/2.6 -> origin/release/2.6 2025-12-04T09:17:11.8039144Z * [new branch] release/2.7 -> origin/release/2.7 2025-12-04T09:17:11.8041355Z * [new branch] release/2.8 -> origin/release/2.8 2025-12-04T09:17:11.8043265Z * [new branch] release/2.9 -> origin/release/2.9 2025-12-04T09:17:11.8045235Z * [new branch] release_notes -> origin/release_notes 2025-12-04T09:17:11.8047233Z * [new branch] remove_pyinterpreter -> origin/remove_pyinterpreter 2025-12-04T09:17:11.8049471Z * [new branch] replace-pytorch-labs-20250812-195836 -> origin/replace-pytorch-labs-20250812-195836 2025-12-04T09:17:11.8051286Z * [new branch] replace-pytorch-labs-20250812-200248 -> origin/replace-pytorch-labs-20250812-200248 2025-12-04T09:17:11.8053029Z * [new branch] replace-pytorch-labs-20250812-200324 -> origin/replace-pytorch-labs-20250812-200324 2025-12-04T09:17:11.8054942Z * [new branch] replace-pytorch-labs-20250812-204020 -> origin/replace-pytorch-labs-20250812-204020 2025-12-04T09:17:11.8059193Z * [new branch] revert-131069-gh/krzysztofjordan/1/head -> origin/revert-131069-gh/krzysztofjordan/1/head 2025-12-04T09:17:11.8063541Z * [new branch] revert-131469-gh/andrewor14/51/head -> origin/revert-131469-gh/andrewor14/51/head 2025-12-04T09:17:11.8067160Z * [new branch] revert-152361-gh/fadara01/1/head -> origin/revert-152361-gh/fadara01/1/head 2025-12-04T09:17:11.8071000Z * [new branch] revert-156870-gh/skarjala/3/head -> origin/revert-156870-gh/skarjala/3/head 2025-12-04T09:17:11.8073179Z * [new branch] revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ -> origin/revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ 2025-12-04T09:17:11.8074945Z * [new branch] revert-hoo-invoke-subgraph -> origin/revert-hoo-invoke-subgraph 2025-12-04T09:17:11.8076917Z * [new branch] revert_always_build_distributed -> origin/revert_always_build_distributed 2025-12-04T09:17:11.8078890Z * [new branch] rms_norm_patch -> origin/rms_norm_patch 2025-12-04T09:17:11.8081807Z * [new branch] ruisi/fix_all_to_all_estimation -> origin/ruisi/fix_all_to_all_estimation 2025-12-04T09:17:11.8083408Z * [new branch] ruisi/fix_comm_estimation -> origin/ruisi/fix_comm_estimation 2025-12-04T09:17:11.8085178Z * [new branch] ruisi/fix_dynamic_shape_estimation -> origin/ruisi/fix_dynamic_shape_estimation 2025-12-04T09:17:11.8086942Z * [new branch] ruisi/fix_llama3_autobucketing -> origin/ruisi/fix_llama3_autobucketing 2025-12-04T09:17:11.8089022Z * [new branch] ruisi/fix_manual_bucketing_ep_pass -> origin/ruisi/fix_manual_bucketing_ep_pass 2025-12-04T09:17:11.8091127Z * [new branch] ruisi/manual_bucket_pass -> origin/ruisi/manual_bucket_pass 2025-12-04T09:17:11.8093973Z * [new branch] ryanguo99/cleanup-dynamo-expected-failures -> origin/ryanguo99/cleanup-dynamo-expected-failures 2025-12-04T09:17:11.8095689Z * [new branch] ryanguo99/fix-closure-var -> origin/ryanguo99/fix-closure-var 2025-12-04T09:17:11.8098229Z * [new branch] rzou/faketensor_bench -> origin/rzou/faketensor_bench 2025-12-04T09:17:11.8099945Z * [new branch] rzou/njt -> origin/rzou/njt 2025-12-04T09:17:11.8101980Z * [new branch] rzou/pca -> origin/rzou/pca 2025-12-04T09:17:11.8103719Z * [new branch] rzou/realprop -> origin/rzou/realprop 2025-12-04T09:17:11.8105633Z * [new branch] samplevllm -> origin/samplevllm 2025-12-04T09:17:11.8108554Z * [new branch] sanchitintel/weird_thing_with_test_cpu_select_algorithm -> origin/sanchitintel/weird_thing_with_test_cpu_select_algorithm 2025-12-04T09:17:11.8110495Z * [new branch] sapling-pr-archive-SS-JIA -> origin/sapling-pr-archive-SS-JIA 2025-12-04T09:17:11.8112479Z * [new branch] sapling-pr-archive-tushar00jain -> origin/sapling-pr-archive-tushar00jain 2025-12-04T09:17:11.8114260Z * [new branch] save -> origin/save 2025-12-04T09:17:11.8116207Z * [new branch] scaled_mm -> origin/scaled_mm 2025-12-04T09:17:11.8118157Z * [new branch] scan_attempt -> origin/scan_attempt 2025-12-04T09:17:11.8120801Z * [new branch] sdym/2.5.1 -> origin/sdym/2.5.1 2025-12-04T09:17:11.8122905Z * [new branch] sekyondaMeta-dynamoconfig-fix -> origin/sekyondaMeta-dynamoconfig-fix 2025-12-04T09:17:11.8125410Z * [new branch] shengf/fx-xform-perf -> origin/shengf/fx-xform-perf 2025-12-04T09:17:11.8127407Z * [new branch] shoumikhin-patch-1 -> origin/shoumikhin-patch-1 2025-12-04T09:17:11.8129361Z * [new branch] solve-accuracy-fix -> origin/solve-accuracy-fix 2025-12-04T09:17:11.8131265Z * [new branch] some_rocm_inductor_skips -> origin/some_rocm_inductor_skips 2025-12-04T09:17:11.8133922Z * [new branch] soulitzer/stash-tls-ac -> origin/soulitzer/stash-tls-ac 2025-12-04T09:17:11.8135908Z * [new branch] sparse-mm-bf16-support -> origin/sparse-mm-bf16-support 2025-12-04T09:17:11.8137954Z * [new branch] starterTaskUpdate -> origin/starterTaskUpdate 2025-12-04T09:17:11.8139906Z * [new branch] suo -> origin/suo 2025-12-04T09:17:11.8141843Z * [new branch] sve-poc -> origin/sve-poc 2025-12-04T09:17:11.8143843Z * [new branch] switch-bn -> origin/switch-bn 2025-12-04T09:17:11.8145887Z * [new branch] sy_annotation_in_autograd_hop -> origin/sy_annotation_in_autograd_hop 2025-12-04T09:17:11.8147770Z * [new branch] sy_aot_eager_record -> origin/sy_aot_eager_record 2025-12-04T09:17:11.8150252Z * [new branch] sy_custom_bucketing -> origin/sy_custom_bucketing 2025-12-04T09:17:11.8152547Z * [new branch] sy_debug_mode_test -> origin/sy_debug_mode_test 2025-12-04T09:17:11.8154304Z * [new branch] sy_deserialize -> origin/sy_deserialize 2025-12-04T09:17:11.8156191Z * [new branch] sy_dump_gm_code -> origin/sy_dump_gm_code 2025-12-04T09:17:11.8158125Z * [new branch] sy_exp -> origin/sy_exp 2025-12-04T09:17:11.8160204Z * [new branch] sy_export_annotation -> origin/sy_export_annotation 2025-12-04T09:17:11.8162130Z * [new branch] sy_invoke_subgraph -> origin/sy_invoke_subgraph 2025-12-04T09:17:11.8164062Z * [new branch] sy_kernel_bw_name -> origin/sy_kernel_bw_name 2025-12-04T09:17:11.8165976Z * [new branch] sy_multi_arch -> origin/sy_multi_arch 2025-12-04T09:17:11.8168010Z * [new branch] sy_nn_module_stack -> origin/sy_nn_module_stack 2025-12-04T09:17:11.8169877Z * [new branch] sy_original_dtensor -> origin/sy_original_dtensor 2025-12-04T09:17:11.8171804Z * [new branch] sy_profiler_cia -> origin/sy_profiler_cia 2025-12-04T09:17:11.8173698Z * [new branch] symm_mem_sync -> origin/symm_mem_sync 2025-12-04T09:17:11.8175689Z * [new branch] sympy-bottleneck-repro -> origin/sympy-bottleneck-repro 2025-12-04T09:17:11.8179271Z * [new branch] tensordict_integration -> origin/tensordict_integration 2025-12-04T09:17:11.8179916Z * [new branch] test-move-conda-builds -> origin/test-move-conda-builds 2025-12-04T09:17:11.8181679Z * [new branch] test-old -> origin/test-old 2025-12-04T09:17:11.8184128Z * [new branch] test/bmm_heur -> origin/test/bmm_heur 2025-12-04T09:17:11.8186669Z * [new branch] tianren/customOp_autotune_fix -> origin/tianren/customOp_autotune_fix 2025-12-04T09:17:11.8188557Z * [new branch] tianren/customOp_enable_max_autotune -> origin/tianren/customOp_enable_max_autotune 2025-12-04T09:17:11.8190239Z * [new branch] tianren/customOp_fusion -> origin/tianren/customOp_fusion 2025-12-04T09:17:11.8192128Z * [new branch] tianren/customop_collectiveop_benchmark -> origin/tianren/customop_collectiveop_benchmark 2025-12-04T09:17:11.8194219Z * [new branch] tianren/customop_collectiveop_benchmark_fix -> origin/tianren/customop_collectiveop_benchmark_fix 2025-12-04T09:17:11.8196795Z * [new branch] tianren/customop_dynamic_config -> origin/tianren/customop_dynamic_config 2025-12-04T09:17:11.8198685Z * [new branch] tianren/dynamic_range_input -> origin/tianren/dynamic_range_input 2025-12-04T09:17:11.8201077Z * [new branch] tianren/dynamic_range_input_fix -> origin/tianren/dynamic_range_input_fix 2025-12-04T09:17:11.8205430Z * [new branch] tianren/dynamic_range_input_merge -> origin/tianren/dynamic_range_input_merge 2025-12-04T09:17:11.8207264Z * [new branch] tianren/flex_paged_attn_fix_temp -> origin/tianren/flex_paged_attn_fix_temp 2025-12-04T09:17:11.8209149Z * [new branch] tianren/fx_codegen_dump -> origin/tianren/fx_codegen_dump 2025-12-04T09:17:11.8211019Z * [new branch] tianren/symmetric_memory -> origin/tianren/symmetric_memory 2025-12-04T09:17:11.8212823Z * [new branch] tianren/test -> origin/tianren/test 2025-12-04T09:17:11.8214892Z * [new branch] tidy_performance_cyy -> origin/tidy_performance_cyy 2025-12-04T09:17:11.8216773Z * [new branch] tmp -> origin/tmp 2025-12-04T09:17:11.8218768Z * [new branch] torchtitan_ep -> origin/torchtitan_ep 2025-12-04T09:17:11.8220751Z * [new branch] torchtitan_integration -> origin/torchtitan_integration 2025-12-04T09:17:11.8222847Z * [new branch] trace_fsdp_torchtune_lora -> origin/trace_fsdp_torchtune_lora 2025-12-04T09:17:11.8224681Z * [new branch] traceable_fsdp_unit_tests -> origin/traceable_fsdp_unit_tests 2025-12-04T09:17:11.8226596Z * [new branch] tree_loop_vec_base -> origin/tree_loop_vec_base 2025-12-04T09:17:11.8228593Z * [new branch] triton_kernel -> origin/triton_kernel 2025-12-04T09:17:11.8230494Z * [new branch] tt_pkg_1908 -> origin/tt_pkg_1908 2025-12-04T09:17:11.8232477Z * [new branch] type_dec -> origin/type_dec 2025-12-04T09:17:11.8234426Z * [new branch] udate-sphinx-dependancies -> origin/udate-sphinx-dependancies 2025-12-04T09:17:11.8241889Z * [new branch] update-audio-commit-hash/17630256502-1803-1 -> origin/update-audio-commit-hash/17630256502-1803-1 2025-12-04T09:17:11.8242533Z * [new branch] update-audio-commit-hash/19087141161-1916-1 -> origin/update-audio-commit-hash/19087141161-1916-1 2025-12-04T09:17:11.8243013Z * [new branch] update-audio-commit-hash/19250643381-1929-1 -> origin/update-audio-commit-hash/19250643381-1929-1 2025-12-04T09:17:11.8244148Z * [new branch] update-audio-commit-hash/19397724337-1935-1 -> origin/update-audio-commit-hash/19397724337-1935-1 2025-12-04T09:17:11.8244506Z * [new branch] update-audio-commit-hash/19555670148-1941-1 -> origin/update-audio-commit-hash/19555670148-1941-1 2025-12-04T09:17:11.8246848Z * [new branch] update-audio-commit-hash/19750627930-1946-1 -> origin/update-audio-commit-hash/19750627930-1946-1 2025-12-04T09:17:11.8249225Z * [new branch] update-triton-commit-hash/13663274526-1487-2 -> origin/update-triton-commit-hash/13663274526-1487-2 2025-12-04T09:17:11.8251621Z * [new branch] update-vision-commit-hash/19087141161-1916-1 -> origin/update-vision-commit-hash/19087141161-1916-1 2025-12-04T09:17:11.8253448Z * [new branch] update-vision-commit-hash/19184897099-1925-1 -> origin/update-vision-commit-hash/19184897099-1925-1 2025-12-04T09:17:11.8255092Z * [new branch] update-vision-commit-hash/19250643381-1929-1 -> origin/update-vision-commit-hash/19250643381-1929-1 2025-12-04T09:17:11.8257416Z * [new branch] update-vision-commit-hash/19381328640-1934-1 -> origin/update-vision-commit-hash/19381328640-1934-1 2025-12-04T09:17:11.8259190Z * [new branch] update-vision-commit-hash/19485237164-1938-1 -> origin/update-vision-commit-hash/19485237164-1938-1 2025-12-04T09:17:11.8261811Z * [new branch] update-vllm-commit-hash/18451675449-1879-1 -> origin/update-vllm-commit-hash/18451675449-1879-1 2025-12-04T09:17:11.8264108Z * [new branch] update-vllm-dockerfile -> origin/update-vllm-dockerfile 2025-12-04T09:17:11.8266838Z * [new branch] update-xla-commit-hash/19224287370-211-1 -> origin/update-xla-commit-hash/19224287370-211-1 2025-12-04T09:17:11.8268801Z * [new branch] update-xla-commit-hash/19422028566-212-1 -> origin/update-xla-commit-hash/19422028566-212-1 2025-12-04T09:17:11.8270525Z * [new branch] update-xla-commit-hash/19626841311-213-1 -> origin/update-xla-commit-hash/19626841311-213-1 2025-12-04T09:17:11.8272504Z * [new branch] update_docs_torch_multinomial_issue#125388 -> origin/update_docs_torch_multinomial_issue#125388 2025-12-04T09:17:11.8274268Z * [new branch] update_operator_readme -> origin/update_operator_readme 2025-12-04T09:17:11.8276313Z * [new branch] update_slow_tests_1722488736 -> origin/update_slow_tests_1722488736 2025-12-04T09:17:11.8278251Z * [new branch] update_slow_tests_1722879173 -> origin/update_slow_tests_1722879173 2025-12-04T09:17:11.8280429Z * [new branch] update_slow_tests_1762155677 -> origin/update_slow_tests_1762155677 2025-12-04T09:17:11.8282435Z * [new branch] update_slow_tests_1763365283 -> origin/update_slow_tests_1763365283 2025-12-04T09:17:11.8284325Z * [new branch] update_submodule_FBGEMM -> origin/update_submodule_FBGEMM 2025-12-04T09:17:11.8286214Z * [new branch] update_submodule_kineto -> origin/update_submodule_kineto 2025-12-04T09:17:11.8288199Z * [new branch] update_submodule_tensorpipe -> origin/update_submodule_tensorpipe 2025-12-04T09:17:11.8290199Z * [new branch] upload-tests-for-autorevert -> origin/upload-tests-for-autorevert 2025-12-04T09:17:11.8292108Z * [new branch] v0.1.2 -> origin/v0.1.2 2025-12-04T09:17:11.8294217Z * [new branch] v1.0.1 -> origin/v1.0.1 2025-12-04T09:17:11.8296265Z * [new branch] v1.0.3 -> origin/v1.0.3 2025-12-04T09:17:11.8298592Z * [new branch] v1.1.0 -> origin/v1.1.0 2025-12-04T09:17:11.8300630Z * [new branch] v1.2.0 -> origin/v1.2.0 2025-12-04T09:17:11.8302982Z * [new branch] v1.3.0 -> origin/v1.3.0 2025-12-04T09:17:11.8305042Z * [new branch] v1.3.1 -> origin/v1.3.1 2025-12-04T09:17:11.8307027Z * [new branch] validate_fn -> origin/validate_fn 2025-12-04T09:17:11.8309237Z * [new branch] validations_2.6 -> origin/validations_2.6 2025-12-04T09:17:11.8311259Z * [new branch] validations_2.8 -> origin/validations_2.8 2025-12-04T09:17:11.8313193Z * [new branch] varlen-api -> origin/varlen-api 2025-12-04T09:17:11.8315165Z * [new branch] varlen-api-backup -> origin/varlen-api-backup 2025-12-04T09:17:11.8317110Z * [new branch] varlen_batch_invariance -> origin/varlen_batch_invariance 2025-12-04T09:17:11.8320020Z * [new branch] viable/strict -> origin/viable/strict 2025-12-04T09:17:11.8322809Z * [new branch] vishal9-team/dtensor_parallelism_toy -> origin/vishal9-team/dtensor_parallelism_toy 2025-12-04T09:17:11.8324592Z * [new branch] vllmbuildci -> origin/vllmbuildci 2025-12-04T09:17:11.8326555Z * [new branch] vllmpin -> origin/vllmpin 2025-12-04T09:17:11.8328831Z * [new branch] vscode-recommend-pyrefly -> origin/vscode-recommend-pyrefly 2025-12-04T09:17:11.8330749Z * [new branch] wdvr-patch-1 -> origin/wdvr-patch-1 2025-12-04T09:17:11.8333282Z * [new branch] wdvr/iss_145259 -> origin/wdvr/iss_145259 2025-12-04T09:17:11.8335792Z * [new branch] whc/pei -> origin/whc/pei 2025-12-04T09:17:11.8337557Z * [new branch] whc/pp_fix -> origin/whc/pp_fix 2025-12-04T09:17:11.8339534Z * [new branch] whc/sharding -> origin/whc/sharding 2025-12-04T09:17:11.8341297Z * [new branch] whc/sharding2 -> origin/whc/sharding2 2025-12-04T09:17:11.8342994Z * [new branch] whc/uneven -> origin/whc/uneven 2025-12-04T09:17:11.8345399Z * [new branch] whc/uneven-merge -> origin/whc/uneven-merge 2025-12-04T09:17:11.8347371Z * [new branch] win_warnings -> origin/win_warnings 2025-12-04T09:17:11.8349267Z * [new branch] windows_libtorch_free -> origin/windows_libtorch_free 2025-12-04T09:17:11.8351150Z * [new branch] xmfan-war -> origin/xmfan-war 2025-12-04T09:17:11.8354167Z * [new branch] xmfan/ca_0516 -> origin/xmfan/ca_0516 2025-12-04T09:17:11.8356023Z * [new branch] xmfan/ca_1051b93192 -> origin/xmfan/ca_1051b93192 2025-12-04T09:17:11.8357989Z * [new branch] xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 -> origin/xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 2025-12-04T09:17:11.8359602Z * [new branch] xmfan/ca_5a2be192d1 -> origin/xmfan/ca_5a2be192d1 2025-12-04T09:17:11.8361441Z * [new branch] xmfan/ca_9d59b516e9 -> origin/xmfan/ca_9d59b516e9 2025-12-04T09:17:11.8363116Z * [new branch] xmfan/ca_apr8 -> origin/xmfan/ca_apr8 2025-12-04T09:17:11.8364859Z * [new branch] xmfan/ca_base -> origin/xmfan/ca_base 2025-12-04T09:17:11.8366877Z * [new branch] xmfan/ca_dynamic -> origin/xmfan/ca_dynamic 2025-12-04T09:17:11.8369163Z * [new branch] xmfan/ca_fix_dyn -> origin/xmfan/ca_fix_dyn 2025-12-04T09:17:11.8371003Z * [new branch] xmfan/ca_fix_lowering -> origin/xmfan/ca_fix_lowering 2025-12-04T09:17:11.8372807Z * [new branch] xmfan/ca_fix_polyfills -> origin/xmfan/ca_fix_polyfills 2025-12-04T09:17:11.8374489Z * [new branch] xmfan/ca_jan3 -> origin/xmfan/ca_jan3 2025-12-04T09:17:11.8376384Z * [new branch] xmfan/ca_jun18 -> origin/xmfan/ca_jun18 2025-12-04T09:17:11.8378276Z * [new branch] xmfan/ca_jun24 -> origin/xmfan/ca_jun24 2025-12-04T09:17:11.8380135Z * [new branch] xmfan/ca_nested -> origin/xmfan/ca_nested 2025-12-04T09:17:11.8381966Z * [new branch] xmfan/ca_overhead -> origin/xmfan/ca_overhead 2025-12-04T09:17:11.8383969Z * [new branch] xmfan/ca_overhead_0eba7e5451 -> origin/xmfan/ca_overhead_0eba7e5451 2025-12-04T09:17:11.8385717Z * [new branch] xmfan/cacu_jun18 -> origin/xmfan/cacu_jun18 2025-12-04T09:17:11.8387527Z * [new branch] xmfan/cacu_jun19 -> origin/xmfan/cacu_jun19 2025-12-04T09:17:11.8389494Z * [new branch] xmfan/cacu_jun4 -> origin/xmfan/cacu_jun4 2025-12-04T09:17:11.8391383Z * [new branch] xmfan/disable_duck_shape -> origin/xmfan/disable_duck_shape 2025-12-04T09:17:11.8393251Z * [new branch] xmfan/fca_cpp_node_passthrough -> origin/xmfan/fca_cpp_node_passthrough 2025-12-04T09:17:11.8395215Z * [new branch] xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T09:17:11.8397041Z * [new branch] xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T09:17:11.8398791Z * [new branch] xmfan/single_step -> origin/xmfan/single_step 2025-12-04T09:17:11.8400667Z * [new branch] xmfan/sth_0829 -> origin/xmfan/sth_0829 2025-12-04T09:17:11.8402773Z * [new branch] xmfan/test -> origin/xmfan/test 2025-12-04T09:17:11.8405325Z * [new branch] yguo/debug-0226-constexpr -> origin/yguo/debug-0226-constexpr 2025-12-04T09:17:11.8407060Z * [new branch] yguo/new_latest_changes -> origin/yguo/new_latest_changes 2025-12-04T09:17:11.8408779Z * [new branch] yguo/patch_constexpr_changes -> origin/yguo/patch_constexpr_changes 2025-12-04T09:17:11.8411227Z * [new branch] yiming/bootcamp -> origin/yiming/bootcamp 2025-12-04T09:17:11.8413558Z * [new branch] yiming/run_with_start_end_rng_hop -> origin/yiming/run_with_start_end_rng_hop 2025-12-04T09:17:11.8415485Z * [new branch] yolo-llama3 -> origin/yolo-llama3 2025-12-04T09:17:11.8418119Z * [new branch] zainr/canary-test -> origin/zainr/canary-test 2025-12-04T09:17:11.8420124Z * [new branch] zainr/cleanup-gh-runners -> origin/zainr/cleanup-gh-runners 2025-12-04T09:17:11.8421796Z * [new branch] zainr/pull-migration-c -> origin/zainr/pull-migration-c 2025-12-04T09:17:11.8423515Z * [new branch] zainr/test2 -> origin/zainr/test2 2025-12-04T09:17:11.8425729Z * [new branch] zasdfgbnm-patch-3 -> origin/zasdfgbnm-patch-3 2025-12-04T09:17:11.8427469Z * [new branch] zb2p -> origin/zb2p 2025-12-04T09:17:11.8429968Z * [new branch] zeros-and-scatter-part2 -> origin/zeros-and-scatter-part2 2025-12-04T09:17:11.8434037Z * [new branch] zhxchen17/ci/vllm_lora_oom -> origin/zhxchen17/ci/vllm_lora_oom 2025-12-04T09:17:11.8435874Z * [new branch] zhxchen17/ci/vllm_multimodal_oom -> origin/zhxchen17/ci/vllm_multimodal_oom 2025-12-04T09:17:11.8437617Z * [new branch] zhxchen17/ci/vllm_pin -> origin/zhxchen17/ci/vllm_pin 2025-12-04T09:17:11.8440360Z * [new branch] zhxchen17/dynamo/unsafe_drop_all_guards -> origin/zhxchen17/dynamo/unsafe_drop_all_guards 2025-12-04T09:17:11.8442903Z * [new branch] zhxchen17/export/call_override -> origin/zhxchen17/export/call_override 2025-12-04T09:17:11.8444665Z * [new branch] zhxchen17/export/codemod1 -> origin/zhxchen17/export/codemod1 2025-12-04T09:17:11.8446471Z * [new branch] zhxchen17/export/ctx_return -> origin/zhxchen17/export/ctx_return 2025-12-04T09:17:11.8448416Z * [new branch] zhxchen17/export/disable_side_effect_warn -> origin/zhxchen17/export/disable_side_effect_warn 2025-12-04T09:17:11.8450129Z * [new branch] zhxchen17/export/pytree_check -> origin/zhxchen17/export/pytree_check 2025-12-04T09:17:11.8452568Z * [new branch] zhxchen17/precompile/aoti -> origin/zhxchen17/precompile/aoti 2025-12-04T09:17:11.8454432Z * [new branch] zhxchen17/precompile/globals -> origin/zhxchen17/precompile/globals 2025-12-04T09:17:11.8456253Z * [new branch] zhxchen17/precompile/inductor_guards -> origin/zhxchen17/precompile/inductor_guards 2025-12-04T09:17:11.8458646Z * [new branch] zhxchen17/scratch/0 -> origin/zhxchen17/scratch/0 2025-12-04T09:17:11.8460476Z * [new branch] zhxchen17/torch_export_api_update -> origin/zhxchen17/torch_export_api_update 2025-12-04T09:17:11.8463045Z * [new branch] zhxhcen17/moodycamel -> origin/zhxhcen17/moodycamel 2025-12-04T09:17:11.8465621Z * [new branch] zxiiro/build-times -> origin/zxiiro/build-times 2025-12-04T09:17:11.8467525Z * [new branch] zxiiro/c7i.2xlarge -> origin/zxiiro/c7i.2xlarge 2025-12-04T09:17:11.8469406Z * [new branch] zxiiro/c7i.2xlarge.h100 -> origin/zxiiro/c7i.2xlarge.h100 2025-12-04T09:17:11.8471161Z * [new branch] zxiiro/main -> origin/zxiiro/main 2025-12-04T09:17:11.8472978Z * [new branch] zxiiro/risc64 -> origin/zxiiro/risc64 2025-12-04T09:17:11.8474847Z * [new branch] zxiiro/test-multicloud-arc -> origin/zxiiro/test-multicloud-arc 2025-12-04T09:17:11.8476550Z * [new tag] bc2caa7fdf006894eff7af936babde69ab5a40f8-huydhn-debug -> bc2caa7fdf006894eff7af936babde69ab5a40f8-huydhn-debug 2025-12-04T09:17:11.8478096Z * [new tag] ci/binaries/77164 -> ci/binaries/77164 2025-12-04T09:17:11.8479969Z * [new tag] ciflow/b200/115316 -> ciflow/b200/115316 2025-12-04T09:17:11.8481281Z * [new tag] ciflow/b200/160685 -> ciflow/b200/160685 2025-12-04T09:17:11.8482378Z * [new tag] ciflow/b200/161607 -> ciflow/b200/161607 2025-12-04T09:17:11.8483761Z * [new tag] ciflow/b200/161938 -> ciflow/b200/161938 2025-12-04T09:17:11.8485139Z * [new tag] ciflow/b200/167207 -> ciflow/b200/167207 2025-12-04T09:17:11.8486441Z * [new tag] ciflow/b200/167989 -> ciflow/b200/167989 2025-12-04T09:17:11.8487792Z * [new tag] ciflow/b200/168096 -> ciflow/b200/168096 2025-12-04T09:17:11.8489223Z * [new tag] ciflow/b200/168175 -> ciflow/b200/168175 2025-12-04T09:17:11.8490609Z * [new tag] ciflow/b200/168195 -> ciflow/b200/168195 2025-12-04T09:17:11.8491862Z * [new tag] ciflow/b200/169200 -> ciflow/b200/169200 2025-12-04T09:17:11.8493299Z * [new tag] ciflow/b200/169216 -> ciflow/b200/169216 2025-12-04T09:17:11.8495128Z * [new tag] ciflow/b200/169380 -> ciflow/b200/169380 2025-12-04T09:17:11.8497038Z * [new tag] ciflow/b200/169412 -> ciflow/b200/169412 2025-12-04T09:17:11.8498644Z * [new tag] ciflow/b200/169470 -> ciflow/b200/169470 2025-12-04T09:17:11.8499982Z * [new tag] ciflow/b200/169471 -> ciflow/b200/169471 2025-12-04T09:17:11.8501674Z * [new tag] ciflow/b200/169472 -> ciflow/b200/169472 2025-12-04T09:17:11.8503663Z * [new tag] ciflow/b200/169514 -> ciflow/b200/169514 2025-12-04T09:17:11.8505005Z * [new tag] ciflow/b200/169517 -> ciflow/b200/169517 2025-12-04T09:17:11.8506655Z * [new tag] ciflow/binaries/165922 -> ciflow/binaries/165922 2025-12-04T09:17:11.8508502Z * [new tag] ciflow/binaries/169510 -> ciflow/binaries/169510 2025-12-04T09:17:11.8510124Z * [new tag] ciflow/binaries_wheel/157994 -> ciflow/binaries_wheel/157994 2025-12-04T09:17:11.8511492Z * [new tag] ciflow/binaries_wheel/166829 -> ciflow/binaries_wheel/166829 2025-12-04T09:17:11.8512622Z * [new tag] ciflow/binaries_wheel/167972 -> ciflow/binaries_wheel/167972 2025-12-04T09:17:11.8514181Z * [new tag] ciflow/binaries_wheel/167981 -> ciflow/binaries_wheel/167981 2025-12-04T09:17:11.8515691Z * [new tag] ciflow/dynamo/167695 -> ciflow/dynamo/167695 2025-12-04T09:17:11.8516910Z * [new tag] ciflow/dynamo/168096 -> ciflow/dynamo/168096 2025-12-04T09:17:11.8518300Z * [new tag] ciflow/dynamo/169525 -> ciflow/dynamo/169525 2025-12-04T09:17:11.8520317Z * [new tag] ciflow/h100-cutlass-backend/161938 -> ciflow/h100-cutlass-backend/161938 2025-12-04T09:17:11.8521208Z * [new tag] ciflow/h100-cutlass-backend/161940 -> ciflow/h100-cutlass-backend/161940 2025-12-04T09:17:11.8522897Z * [new tag] ciflow/h100-distributed/168923 -> ciflow/h100-distributed/168923 2025-12-04T09:17:11.8524341Z * [new tag] ciflow/h100-symm-mem/167552 -> ciflow/h100-symm-mem/167552 2025-12-04T09:17:11.8525584Z * [new tag] ciflow/h100-symm-mem/168129 -> ciflow/h100-symm-mem/168129 2025-12-04T09:17:11.8526827Z * [new tag] ciflow/h100-symm-mem/168917 -> ciflow/h100-symm-mem/168917 2025-12-04T09:17:11.8528443Z * [new tag] ciflow/h100-symm-mem/169156 -> ciflow/h100-symm-mem/169156 2025-12-04T09:17:11.8529362Z * [new tag] ciflow/h100-symm-mem/169200 -> ciflow/h100-symm-mem/169200 2025-12-04T09:17:11.8530841Z * [new tag] ciflow/h100-symm-mem/169216 -> ciflow/h100-symm-mem/169216 2025-12-04T09:17:11.8532094Z * [new tag] ciflow/h100-symm-mem/169338 -> ciflow/h100-symm-mem/169338 2025-12-04T09:17:11.8533424Z * [new tag] ciflow/h100-symm-mem/169355 -> ciflow/h100-symm-mem/169355 2025-12-04T09:17:11.8534652Z * [new tag] ciflow/h100-symm-mem/169543 -> ciflow/h100-symm-mem/169543 2025-12-04T09:17:11.8536141Z * [new tag] ciflow/h100/115316 -> ciflow/h100/115316 2025-12-04T09:17:11.8537427Z * [new tag] ciflow/h100/160685 -> ciflow/h100/160685 2025-12-04T09:17:11.8538897Z * [new tag] ciflow/h100/160729 -> ciflow/h100/160729 2025-12-04T09:17:11.8539743Z * [new tag] ciflow/h100/161607 -> ciflow/h100/161607 2025-12-04T09:17:11.8541223Z * [new tag] ciflow/h100/161938 -> ciflow/h100/161938 2025-12-04T09:17:11.8542869Z * [new tag] ciflow/h100/167207 -> ciflow/h100/167207 2025-12-04T09:17:11.8544152Z * [new tag] ciflow/h100/167989 -> ciflow/h100/167989 2025-12-04T09:17:11.8545358Z * [new tag] ciflow/h100/168096 -> ciflow/h100/168096 2025-12-04T09:17:11.8546375Z * [new tag] ciflow/h100/168175 -> ciflow/h100/168175 2025-12-04T09:17:11.8547836Z * [new tag] ciflow/h100/168195 -> ciflow/h100/168195 2025-12-04T09:17:11.8549141Z * [new tag] ciflow/h100/168980 -> ciflow/h100/168980 2025-12-04T09:17:11.8550710Z * [new tag] ciflow/h100/169200 -> ciflow/h100/169200 2025-12-04T09:17:11.8552334Z * [new tag] ciflow/h100/169216 -> ciflow/h100/169216 2025-12-04T09:17:11.8553876Z * [new tag] ciflow/h100/169380 -> ciflow/h100/169380 2025-12-04T09:17:11.8555209Z * [new tag] ciflow/h100/169412 -> ciflow/h100/169412 2025-12-04T09:17:11.8556534Z * [new tag] ciflow/h100/169470 -> ciflow/h100/169470 2025-12-04T09:17:11.8557850Z * [new tag] ciflow/h100/169471 -> ciflow/h100/169471 2025-12-04T09:17:11.8559204Z * [new tag] ciflow/h100/169472 -> ciflow/h100/169472 2025-12-04T09:17:11.8560660Z * [new tag] ciflow/h100/169514 -> ciflow/h100/169514 2025-12-04T09:17:11.8562194Z * [new tag] ciflow/inductor-cu126/168096 -> ciflow/inductor-cu126/168096 2025-12-04T09:17:11.8564303Z * [new tag] ciflow/inductor-micro-benchmark-cpu-x86/168096 -> ciflow/inductor-micro-benchmark-cpu-x86/168096 2025-12-04T09:17:11.8565812Z * [new tag] ciflow/inductor-micro-benchmark/166165 -> ciflow/inductor-micro-benchmark/166165 2025-12-04T09:17:11.8566851Z * [new tag] ciflow/inductor-micro-benchmark/168096 -> ciflow/inductor-micro-benchmark/168096 2025-12-04T09:17:11.8568605Z * [new tag] ciflow/inductor-perf-compare/168096 -> ciflow/inductor-perf-compare/168096 2025-12-04T09:17:11.8570517Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/168073 -> ciflow/inductor-perf-test-nightly-rocm-mi300/168073 2025-12-04T09:17:11.8571513Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/168096 -> ciflow/inductor-perf-test-nightly-rocm-mi300/168096 2025-12-04T09:17:11.8573165Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/169024 -> ciflow/inductor-perf-test-nightly-rocm-mi300/169024 2025-12-04T09:17:11.8574682Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi355/169024 -> ciflow/inductor-perf-test-nightly-rocm-mi355/169024 2025-12-04T09:17:11.8576205Z * [new tag] ciflow/inductor-perf-test-nightly/168096 -> ciflow/inductor-perf-test-nightly/168096 2025-12-04T09:17:11.8577697Z * [new tag] ciflow/inductor-periodic/168096 -> ciflow/inductor-periodic/168096 2025-12-04T09:17:11.8578912Z * [new tag] ciflow/inductor-periodic/169024 -> ciflow/inductor-periodic/169024 2025-12-04T09:17:11.8580869Z * [new tag] ciflow/inductor-periodic/169425 -> ciflow/inductor-periodic/169425 2025-12-04T09:17:11.8582565Z * [new tag] ciflow/inductor-rocm-mi200/165545 -> ciflow/inductor-rocm-mi200/165545 2025-12-04T09:17:11.8583882Z * [new tag] ciflow/inductor-rocm-mi200/165997 -> ciflow/inductor-rocm-mi200/165997 2025-12-04T09:17:11.8584878Z * [new tag] ciflow/inductor-rocm-mi200/168096 -> ciflow/inductor-rocm-mi200/168096 2025-12-04T09:17:11.8586542Z * [new tag] ciflow/inductor-rocm-mi200/169063 -> ciflow/inductor-rocm-mi200/169063 2025-12-04T09:17:11.8587710Z * [new tag] ciflow/inductor-rocm-mi200/169425 -> ciflow/inductor-rocm-mi200/169425 2025-12-04T09:17:11.8589477Z * [new tag] ciflow/inductor-rocm-mi300/165545 -> ciflow/inductor-rocm-mi300/165545 2025-12-04T09:17:11.8590395Z * [new tag] ciflow/inductor-rocm-mi300/168096 -> ciflow/inductor-rocm-mi300/168096 2025-12-04T09:17:11.8591843Z * [new tag] ciflow/inductor-rocm-mi300/169063 -> ciflow/inductor-rocm-mi300/169063 2025-12-04T09:17:11.8592880Z * [new tag] ciflow/inductor-rocm-mi300/169425 -> ciflow/inductor-rocm-mi300/169425 2025-12-04T09:17:11.8594737Z * [new tag] ciflow/inductor-rocm/162052 -> ciflow/inductor-rocm/162052 2025-12-04T09:17:11.8595859Z * [new tag] ciflow/inductor-rocm/168971 -> ciflow/inductor-rocm/168971 2025-12-04T09:17:11.8597553Z * [new tag] ciflow/inductor-windows/168096 -> ciflow/inductor-windows/168096 2025-12-04T09:17:11.8599045Z * [new tag] ciflow/inductor/144542 -> ciflow/inductor/144542 2025-12-04T09:17:11.8600539Z * [new tag] ciflow/inductor/146506 -> ciflow/inductor/146506 2025-12-04T09:17:11.8602986Z * [new tag] ciflow/inductor/147990 -> ciflow/inductor/147990 2025-12-04T09:17:11.8604306Z * [new tag] ciflow/inductor/148294 -> ciflow/inductor/148294 2025-12-04T09:17:11.8605554Z * [new tag] ciflow/inductor/148492 -> ciflow/inductor/148492 2025-12-04T09:17:11.8606789Z * [new tag] ciflow/inductor/157149 -> ciflow/inductor/157149 2025-12-04T09:17:11.8608020Z * [new tag] ciflow/inductor/157994 -> ciflow/inductor/157994 2025-12-04T09:17:11.8609295Z * [new tag] ciflow/inductor/160685 -> ciflow/inductor/160685 2025-12-04T09:17:11.8610512Z * [new tag] ciflow/inductor/160686 -> ciflow/inductor/160686 2025-12-04T09:17:11.8611863Z * [new tag] ciflow/inductor/160687 -> ciflow/inductor/160687 2025-12-04T09:17:11.8613118Z * [new tag] ciflow/inductor/160688 -> ciflow/inductor/160688 2025-12-04T09:17:11.8614748Z * [new tag] ciflow/inductor/160706 -> ciflow/inductor/160706 2025-12-04T09:17:11.8616435Z * [new tag] ciflow/inductor/160729 -> ciflow/inductor/160729 2025-12-04T09:17:11.8618018Z * [new tag] ciflow/inductor/161938 -> ciflow/inductor/161938 2025-12-04T09:17:11.8619381Z * [new tag] ciflow/inductor/161939 -> ciflow/inductor/161939 2025-12-04T09:17:11.8620692Z * [new tag] ciflow/inductor/161940 -> ciflow/inductor/161940 2025-12-04T09:17:11.8621963Z * [new tag] ciflow/inductor/162052 -> ciflow/inductor/162052 2025-12-04T09:17:11.8623261Z * [new tag] ciflow/inductor/162275 -> ciflow/inductor/162275 2025-12-04T09:17:11.8624562Z * [new tag] ciflow/inductor/162795 -> ciflow/inductor/162795 2025-12-04T09:17:11.8626101Z * [new tag] ciflow/inductor/163245 -> ciflow/inductor/163245 2025-12-04T09:17:11.8627407Z * [new tag] ciflow/inductor/163335 -> ciflow/inductor/163335 2025-12-04T09:17:11.8628726Z * [new tag] ciflow/inductor/163503 -> ciflow/inductor/163503 2025-12-04T09:17:11.8630049Z * [new tag] ciflow/inductor/163942 -> ciflow/inductor/163942 2025-12-04T09:17:11.8631556Z * [new tag] ciflow/inductor/165270 -> ciflow/inductor/165270 2025-12-04T09:17:11.8633316Z * [new tag] ciflow/inductor/165274 -> ciflow/inductor/165274 2025-12-04T09:17:11.8634722Z * [new tag] ciflow/inductor/165322 -> ciflow/inductor/165322 2025-12-04T09:17:11.8635998Z * [new tag] ciflow/inductor/165597 -> ciflow/inductor/165597 2025-12-04T09:17:11.8637353Z * [new tag] ciflow/inductor/166063 -> ciflow/inductor/166063 2025-12-04T09:17:11.8639011Z * [new tag] ciflow/inductor/166075 -> ciflow/inductor/166075 2025-12-04T09:17:11.8640034Z * [new tag] ciflow/inductor/166165 -> ciflow/inductor/166165 2025-12-04T09:17:11.8641884Z * [new tag] ciflow/inductor/166254 -> ciflow/inductor/166254 2025-12-04T09:17:11.8643028Z * [new tag] ciflow/inductor/166483 -> ciflow/inductor/166483 2025-12-04T09:17:11.8644327Z * [new tag] ciflow/inductor/166494 -> ciflow/inductor/166494 2025-12-04T09:17:11.8645657Z * [new tag] ciflow/inductor/166545 -> ciflow/inductor/166545 2025-12-04T09:17:11.8646946Z * [new tag] ciflow/inductor/166788 -> ciflow/inductor/166788 2025-12-04T09:17:11.8648537Z * [new tag] ciflow/inductor/166846 -> ciflow/inductor/166846 2025-12-04T09:17:11.8649814Z * [new tag] ciflow/inductor/167300 -> ciflow/inductor/167300 2025-12-04T09:17:11.8651114Z * [new tag] ciflow/inductor/167407 -> ciflow/inductor/167407 2025-12-04T09:17:11.8652525Z * [new tag] ciflow/inductor/167536 -> ciflow/inductor/167536 2025-12-04T09:17:11.8653881Z * [new tag] ciflow/inductor/167552 -> ciflow/inductor/167552 2025-12-04T09:17:11.8655160Z * [new tag] ciflow/inductor/167555 -> ciflow/inductor/167555 2025-12-04T09:17:11.8656718Z * [new tag] ciflow/inductor/167583 -> ciflow/inductor/167583 2025-12-04T09:17:11.8658044Z * [new tag] ciflow/inductor/167599 -> ciflow/inductor/167599 2025-12-04T09:17:11.8659434Z * [new tag] ciflow/inductor/167647 -> ciflow/inductor/167647 2025-12-04T09:17:11.8660791Z * [new tag] ciflow/inductor/167677 -> ciflow/inductor/167677 2025-12-04T09:17:11.8662101Z * [new tag] ciflow/inductor/167680 -> ciflow/inductor/167680 2025-12-04T09:17:11.8663377Z * [new tag] ciflow/inductor/167695 -> ciflow/inductor/167695 2025-12-04T09:17:11.8664713Z * [new tag] ciflow/inductor/167742 -> ciflow/inductor/167742 2025-12-04T09:17:11.8666022Z * [new tag] ciflow/inductor/167768 -> ciflow/inductor/167768 2025-12-04T09:17:11.8667566Z * [new tag] ciflow/inductor/167773 -> ciflow/inductor/167773 2025-12-04T09:17:11.8669000Z * [new tag] ciflow/inductor/167781 -> ciflow/inductor/167781 2025-12-04T09:17:11.8670592Z * [new tag] ciflow/inductor/167880 -> ciflow/inductor/167880 2025-12-04T09:17:11.8671699Z * [new tag] ciflow/inductor/167887 -> ciflow/inductor/167887 2025-12-04T09:17:11.8673055Z * [new tag] ciflow/inductor/167972 -> ciflow/inductor/167972 2025-12-04T09:17:11.8674371Z * [new tag] ciflow/inductor/167989 -> ciflow/inductor/167989 2025-12-04T09:17:11.8675665Z * [new tag] ciflow/inductor/168002 -> ciflow/inductor/168002 2025-12-04T09:17:11.8677015Z * [new tag] ciflow/inductor/168050 -> ciflow/inductor/168050 2025-12-04T09:17:11.8678365Z * [new tag] ciflow/inductor/168051 -> ciflow/inductor/168051 2025-12-04T09:17:11.8679731Z * [new tag] ciflow/inductor/168052 -> ciflow/inductor/168052 2025-12-04T09:17:11.8681149Z * [new tag] ciflow/inductor/168073 -> ciflow/inductor/168073 2025-12-04T09:17:11.8682463Z * [new tag] ciflow/inductor/168096 -> ciflow/inductor/168096 2025-12-04T09:17:11.8683786Z * [new tag] ciflow/inductor/168114 -> ciflow/inductor/168114 2025-12-04T09:17:11.8685103Z * [new tag] ciflow/inductor/168115 -> ciflow/inductor/168115 2025-12-04T09:17:11.8686422Z * [new tag] ciflow/inductor/168127 -> ciflow/inductor/168127 2025-12-04T09:17:11.8687708Z * [new tag] ciflow/inductor/168129 -> ciflow/inductor/168129 2025-12-04T09:17:11.8689026Z * [new tag] ciflow/inductor/168157 -> ciflow/inductor/168157 2025-12-04T09:17:11.8690445Z * [new tag] ciflow/inductor/168175 -> ciflow/inductor/168175 2025-12-04T09:17:11.8691699Z * [new tag] ciflow/inductor/168185 -> ciflow/inductor/168185 2025-12-04T09:17:11.8693027Z * [new tag] ciflow/inductor/168195 -> ciflow/inductor/168195 2025-12-04T09:17:11.8694235Z * [new tag] ciflow/inductor/168209 -> ciflow/inductor/168209 2025-12-04T09:17:11.8695658Z * [new tag] ciflow/inductor/168266 -> ciflow/inductor/168266 2025-12-04T09:17:11.8697006Z * [new tag] ciflow/inductor/168316 -> ciflow/inductor/168316 2025-12-04T09:17:11.8698502Z * [new tag] ciflow/inductor/168326 -> ciflow/inductor/168326 2025-12-04T09:17:11.8699889Z * [new tag] ciflow/inductor/168368 -> ciflow/inductor/168368 2025-12-04T09:17:11.8701431Z * [new tag] ciflow/inductor/168894 -> ciflow/inductor/168894 2025-12-04T09:17:11.8702769Z * [new tag] ciflow/inductor/168934 -> ciflow/inductor/168934 2025-12-04T09:17:11.8704051Z * [new tag] ciflow/inductor/168939 -> ciflow/inductor/168939 2025-12-04T09:17:11.8705367Z * [new tag] ciflow/inductor/168946 -> ciflow/inductor/168946 2025-12-04T09:17:11.8706682Z * [new tag] ciflow/inductor/168950 -> ciflow/inductor/168950 2025-12-04T09:17:11.8708010Z * [new tag] ciflow/inductor/168951 -> ciflow/inductor/168951 2025-12-04T09:17:11.8709367Z * [new tag] ciflow/inductor/168952 -> ciflow/inductor/168952 2025-12-04T09:17:11.8710682Z * [new tag] ciflow/inductor/168955 -> ciflow/inductor/168955 2025-12-04T09:17:11.8712033Z * [new tag] ciflow/inductor/168971 -> ciflow/inductor/168971 2025-12-04T09:17:11.8713332Z * [new tag] ciflow/inductor/168979 -> ciflow/inductor/168979 2025-12-04T09:17:11.8714671Z * [new tag] ciflow/inductor/168980 -> ciflow/inductor/168980 2025-12-04T09:17:11.8716207Z * [new tag] ciflow/inductor/168983 -> ciflow/inductor/168983 2025-12-04T09:17:11.8717563Z * [new tag] ciflow/inductor/169006 -> ciflow/inductor/169006 2025-12-04T09:17:11.8718979Z * [new tag] ciflow/inductor/169023 -> ciflow/inductor/169023 2025-12-04T09:17:11.8721130Z * [new tag] ciflow/inductor/169024 -> ciflow/inductor/169024 2025-12-04T09:17:11.8722440Z * [new tag] ciflow/inductor/169025 -> ciflow/inductor/169025 2025-12-04T09:17:11.8723933Z * [new tag] ciflow/inductor/169066 -> ciflow/inductor/169066 2025-12-04T09:17:11.8725052Z * [new tag] ciflow/inductor/169091 -> ciflow/inductor/169091 2025-12-04T09:17:11.8726353Z * [new tag] ciflow/inductor/169102 -> ciflow/inductor/169102 2025-12-04T09:17:11.8727855Z * [new tag] ciflow/inductor/169103 -> ciflow/inductor/169103 2025-12-04T09:17:11.8729122Z * [new tag] ciflow/inductor/169121 -> ciflow/inductor/169121 2025-12-04T09:17:11.8730354Z * [new tag] ciflow/inductor/169134 -> ciflow/inductor/169134 2025-12-04T09:17:11.8731684Z * [new tag] ciflow/inductor/169135 -> ciflow/inductor/169135 2025-12-04T09:17:11.8732896Z * [new tag] ciflow/inductor/169141 -> ciflow/inductor/169141 2025-12-04T09:17:11.8734715Z * [new tag] ciflow/inductor/169151 -> ciflow/inductor/169151 2025-12-04T09:17:11.8735839Z * [new tag] ciflow/inductor/169161 -> ciflow/inductor/169161 2025-12-04T09:17:11.8737055Z * [new tag] ciflow/inductor/169167 -> ciflow/inductor/169167 2025-12-04T09:17:11.8738890Z * [new tag] ciflow/inductor/169177 -> ciflow/inductor/169177 2025-12-04T09:17:11.8740256Z * [new tag] ciflow/inductor/169185 -> ciflow/inductor/169185 2025-12-04T09:17:11.8741447Z * [new tag] ciflow/inductor/169196 -> ciflow/inductor/169196 2025-12-04T09:17:11.8742750Z * [new tag] ciflow/inductor/169200 -> ciflow/inductor/169200 2025-12-04T09:17:11.8744191Z * [new tag] ciflow/inductor/169204 -> ciflow/inductor/169204 2025-12-04T09:17:11.8745466Z * [new tag] ciflow/inductor/169216 -> ciflow/inductor/169216 2025-12-04T09:17:11.8746988Z * [new tag] ciflow/inductor/169219 -> ciflow/inductor/169219 2025-12-04T09:17:11.8748088Z * [new tag] ciflow/inductor/169220 -> ciflow/inductor/169220 2025-12-04T09:17:11.8749778Z * [new tag] ciflow/inductor/169230 -> ciflow/inductor/169230 2025-12-04T09:17:11.8751004Z * [new tag] ciflow/inductor/169242 -> ciflow/inductor/169242 2025-12-04T09:17:11.8752750Z * [new tag] ciflow/inductor/169245 -> ciflow/inductor/169245 2025-12-04T09:17:11.8753844Z * [new tag] ciflow/inductor/169260 -> ciflow/inductor/169260 2025-12-04T09:17:11.8755300Z * [new tag] ciflow/inductor/169282 -> ciflow/inductor/169282 2025-12-04T09:17:11.8756419Z * [new tag] ciflow/inductor/169286 -> ciflow/inductor/169286 2025-12-04T09:17:11.8757936Z * [new tag] ciflow/inductor/169299 -> ciflow/inductor/169299 2025-12-04T09:17:11.8759489Z * [new tag] ciflow/inductor/169304 -> ciflow/inductor/169304 2025-12-04T09:17:11.8761338Z * [new tag] ciflow/inductor/169305 -> ciflow/inductor/169305 2025-12-04T09:17:11.8762445Z * [new tag] ciflow/inductor/169308 -> ciflow/inductor/169308 2025-12-04T09:17:11.8764015Z * [new tag] ciflow/inductor/169319 -> ciflow/inductor/169319 2025-12-04T09:17:11.8765132Z * [new tag] ciflow/inductor/169326 -> ciflow/inductor/169326 2025-12-04T09:17:11.8766666Z * [new tag] ciflow/inductor/169332 -> ciflow/inductor/169332 2025-12-04T09:17:11.8767783Z * [new tag] ciflow/inductor/169333 -> ciflow/inductor/169333 2025-12-04T09:17:11.8769548Z * [new tag] ciflow/inductor/169336 -> ciflow/inductor/169336 2025-12-04T09:17:11.8770757Z * [new tag] ciflow/inductor/169340 -> ciflow/inductor/169340 2025-12-04T09:17:11.8772300Z * [new tag] ciflow/inductor/169341 -> ciflow/inductor/169341 2025-12-04T09:17:11.8773478Z * [new tag] ciflow/inductor/169343 -> ciflow/inductor/169343 2025-12-04T09:17:11.8774985Z * [new tag] ciflow/inductor/169346 -> ciflow/inductor/169346 2025-12-04T09:17:11.8776444Z * [new tag] ciflow/inductor/169348 -> ciflow/inductor/169348 2025-12-04T09:17:11.8777894Z * [new tag] ciflow/inductor/169350 -> ciflow/inductor/169350 2025-12-04T09:17:11.8779069Z * [new tag] ciflow/inductor/169355 -> ciflow/inductor/169355 2025-12-04T09:17:11.8780590Z * [new tag] ciflow/inductor/169370 -> ciflow/inductor/169370 2025-12-04T09:17:11.8782251Z * [new tag] ciflow/inductor/169375 -> ciflow/inductor/169375 2025-12-04T09:17:11.8783382Z * [new tag] ciflow/inductor/169389 -> ciflow/inductor/169389 2025-12-04T09:17:11.8784891Z * [new tag] ciflow/inductor/169391 -> ciflow/inductor/169391 2025-12-04T09:17:11.8786064Z * [new tag] ciflow/inductor/169393 -> ciflow/inductor/169393 2025-12-04T09:17:11.8787539Z * [new tag] ciflow/inductor/169399 -> ciflow/inductor/169399 2025-12-04T09:17:11.8789175Z * [new tag] ciflow/inductor/169400 -> ciflow/inductor/169400 2025-12-04T09:17:11.8790328Z * [new tag] ciflow/inductor/169415 -> ciflow/inductor/169415 2025-12-04T09:17:11.8792243Z * [new tag] ciflow/inductor/169417 -> ciflow/inductor/169417 2025-12-04T09:17:11.8793052Z * [new tag] ciflow/inductor/169418 -> ciflow/inductor/169418 2025-12-04T09:17:11.8794812Z * [new tag] ciflow/inductor/169430 -> ciflow/inductor/169430 2025-12-04T09:17:11.8795954Z * [new tag] ciflow/inductor/169432 -> ciflow/inductor/169432 2025-12-04T09:17:11.8797456Z * [new tag] ciflow/inductor/169436 -> ciflow/inductor/169436 2025-12-04T09:17:11.8798826Z * [new tag] ciflow/inductor/169437 -> ciflow/inductor/169437 2025-12-04T09:17:11.8800640Z * [new tag] ciflow/inductor/169438 -> ciflow/inductor/169438 2025-12-04T09:17:11.8803508Z * [new tag] ciflow/inductor/169441 -> ciflow/inductor/169441 2025-12-04T09:17:11.8804598Z * [new tag] ciflow/inductor/169446 -> ciflow/inductor/169446 2025-12-04T09:17:11.8806254Z * [new tag] ciflow/inductor/169447 -> ciflow/inductor/169447 2025-12-04T09:17:11.8807434Z * [new tag] ciflow/inductor/169452 -> ciflow/inductor/169452 2025-12-04T09:17:11.8809282Z * [new tag] ciflow/inductor/169455 -> ciflow/inductor/169455 2025-12-04T09:17:11.8810424Z * [new tag] ciflow/inductor/169459 -> ciflow/inductor/169459 2025-12-04T09:17:11.8812460Z * [new tag] ciflow/inductor/169463 -> ciflow/inductor/169463 2025-12-04T09:17:11.8814014Z * [new tag] ciflow/inductor/169476 -> ciflow/inductor/169476 2025-12-04T09:17:11.8815157Z * [new tag] ciflow/inductor/169485 -> ciflow/inductor/169485 2025-12-04T09:17:11.8816664Z * [new tag] ciflow/inductor/169493 -> ciflow/inductor/169493 2025-12-04T09:17:11.8817854Z * [new tag] ciflow/inductor/169496 -> ciflow/inductor/169496 2025-12-04T09:17:11.8819473Z * [new tag] ciflow/inductor/169497 -> ciflow/inductor/169497 2025-12-04T09:17:11.8820608Z * [new tag] ciflow/inductor/169503 -> ciflow/inductor/169503 2025-12-04T09:17:11.8822126Z * [new tag] ciflow/inductor/169504 -> ciflow/inductor/169504 2025-12-04T09:17:11.8823664Z * [new tag] ciflow/inductor/169505 -> ciflow/inductor/169505 2025-12-04T09:17:11.8825413Z * [new tag] ciflow/inductor/169508 -> ciflow/inductor/169508 2025-12-04T09:17:11.8826550Z * [new tag] ciflow/inductor/169509 -> ciflow/inductor/169509 2025-12-04T09:17:11.8828137Z * [new tag] ciflow/inductor/169513 -> ciflow/inductor/169513 2025-12-04T09:17:11.8829349Z * [new tag] ciflow/inductor/169514 -> ciflow/inductor/169514 2025-12-04T09:17:11.8830895Z * [new tag] ciflow/inductor/169515 -> ciflow/inductor/169515 2025-12-04T09:17:11.8832080Z * [new tag] ciflow/inductor/169517 -> ciflow/inductor/169517 2025-12-04T09:17:11.8833653Z * [new tag] ciflow/inductor/169519 -> ciflow/inductor/169519 2025-12-04T09:17:11.8834809Z * [new tag] ciflow/inductor/169520 -> ciflow/inductor/169520 2025-12-04T09:17:11.8836438Z * [new tag] ciflow/inductor/169521 -> ciflow/inductor/169521 2025-12-04T09:17:11.8837605Z * [new tag] ciflow/inductor/169524 -> ciflow/inductor/169524 2025-12-04T09:17:11.8839385Z * [new tag] ciflow/inductor/169527 -> ciflow/inductor/169527 2025-12-04T09:17:11.8840391Z * [new tag] ciflow/inductor/169528 -> ciflow/inductor/169528 2025-12-04T09:17:11.8842026Z * [new tag] ciflow/inductor/169532 -> ciflow/inductor/169532 2025-12-04T09:17:11.8843165Z * [new tag] ciflow/inductor/169535 -> ciflow/inductor/169535 2025-12-04T09:17:11.8844680Z * [new tag] ciflow/inductor/169536 -> ciflow/inductor/169536 2025-12-04T09:17:11.8846003Z * [new tag] ciflow/inductor/169547 -> ciflow/inductor/169547 2025-12-04T09:17:11.8847214Z * [new tag] ciflow/inductor/169548 -> ciflow/inductor/169548 2025-12-04T09:17:11.8848750Z * [new tag] ciflow/inductor/169549 -> ciflow/inductor/169549 2025-12-04T09:17:11.8849913Z * [new tag] ciflow/inductor/169551 -> ciflow/inductor/169551 2025-12-04T09:17:11.8851453Z * [new tag] ciflow/inductor/169552 -> ciflow/inductor/169552 2025-12-04T09:17:11.8852650Z * [new tag] ciflow/inductor/169553 -> ciflow/inductor/169553 2025-12-04T09:17:11.8854168Z * [new tag] ciflow/inductor/169557 -> ciflow/inductor/169557 2025-12-04T09:17:11.8855778Z * [new tag] ciflow/inductor/3b9a386 -> ciflow/inductor/3b9a386 2025-12-04T09:17:11.8857552Z * [new tag] ciflow/inductor/3d4b92b -> ciflow/inductor/3d4b92b 2025-12-04T09:17:11.8858637Z * [new tag] ciflow/inductor/d224ac7 -> ciflow/inductor/d224ac7 2025-12-04T09:17:11.8860358Z * [new tag] ciflow/linux-aarch64/157994 -> ciflow/linux-aarch64/157994 2025-12-04T09:17:11.8861407Z * [new tag] ciflow/linux-aarch64/166075 -> ciflow/linux-aarch64/166075 2025-12-04T09:17:11.8862619Z * [new tag] ciflow/linux-aarch64/166876 -> ciflow/linux-aarch64/166876 2025-12-04T09:17:11.8864100Z * [new tag] ciflow/linux-aarch64/167981 -> ciflow/linux-aarch64/167981 2025-12-04T09:17:11.8865569Z * [new tag] ciflow/mps/166254 -> ciflow/mps/166254 2025-12-04T09:17:11.8866631Z * [new tag] ciflow/mps/169017 -> ciflow/mps/169017 2025-12-04T09:17:11.8868217Z * [new tag] ciflow/mps/169372 -> ciflow/mps/169372 2025-12-04T09:17:11.8869229Z * [new tag] ciflow/mps/169478 -> ciflow/mps/169478 2025-12-04T09:17:11.8870988Z * [new tag] ciflow/op-benchmark/157994 -> ciflow/op-benchmark/157994 2025-12-04T09:17:11.8872027Z * [new tag] ciflow/op-benchmark/166075 -> ciflow/op-benchmark/166075 2025-12-04T09:17:11.8873492Z * [new tag] ciflow/op-benchmark/169544 -> ciflow/op-benchmark/169544 2025-12-04T09:17:11.8875196Z * [new tag] ciflow/periodic-rocm-mi200/165997 -> ciflow/periodic-rocm-mi200/165997 2025-12-04T09:17:11.8876366Z * [new tag] ciflow/periodic-rocm-mi200/166517 -> ciflow/periodic-rocm-mi200/166517 2025-12-04T09:17:11.8877567Z * [new tag] ciflow/periodic-rocm-mi200/169063 -> ciflow/periodic-rocm-mi200/169063 2025-12-04T09:17:11.8878932Z * [new tag] ciflow/periodic-rocm-mi200/169425 -> ciflow/periodic-rocm-mi200/169425 2025-12-04T09:17:11.8880808Z * [new tag] ciflow/periodic-rocm-mi300/166517 -> ciflow/periodic-rocm-mi300/166517 2025-12-04T09:17:11.8881855Z * [new tag] ciflow/periodic-rocm-mi300/169063 -> ciflow/periodic-rocm-mi300/169063 2025-12-04T09:17:11.8883069Z * [new tag] ciflow/periodic-rocm-mi300/169425 -> ciflow/periodic-rocm-mi300/169425 2025-12-04T09:17:11.8885046Z * [new tag] ciflow/periodic/054a2fd -> ciflow/periodic/054a2fd 2025-12-04T09:17:11.8886081Z * [new tag] ciflow/periodic/167207 -> ciflow/periodic/167207 2025-12-04T09:17:11.8887630Z * [new tag] ciflow/periodic/167978 -> ciflow/periodic/167978 2025-12-04T09:17:11.8888741Z * [new tag] ciflow/periodic/168096 -> ciflow/periodic/168096 2025-12-04T09:17:11.8889951Z * [new tag] ciflow/periodic/169286 -> ciflow/periodic/169286 2025-12-04T09:17:11.8891729Z * [new tag] ciflow/periodic/2a6d37d -> ciflow/periodic/2a6d37d 2025-12-04T09:17:11.8892882Z * [new tag] ciflow/periodic/317eeb8 -> ciflow/periodic/317eeb8 2025-12-04T09:17:11.8894549Z * [new tag] ciflow/periodic/3c32 -> ciflow/periodic/3c32 2025-12-04T09:17:11.8895723Z * [new tag] ciflow/periodic/3e98831 -> ciflow/periodic/3e98831 2025-12-04T09:17:11.8897987Z * [new tag] ciflow/periodic/7c648509a7470ace9fb2bae960dd4790f7e943e9 -> ciflow/periodic/7c648509a7470ace9fb2bae960dd4790f7e943e9 2025-12-04T09:17:11.8899362Z * [new tag] ciflow/periodic/94512-point -> ciflow/periodic/94512-point 2025-12-04T09:17:11.8901638Z * [new tag] ciflow/periodic/csl/test87519 -> ciflow/periodic/csl/test87519 2025-12-04T09:17:11.8902905Z * [new tag] ciflow/periodic/csltest88275 -> ciflow/periodic/csltest88275 2025-12-04T09:17:11.8904529Z * [new tag] ciflow/periodic/csltest88761 -> ciflow/periodic/csltest88761 2025-12-04T09:17:11.8906101Z * [new tag] ciflow/periodic/release_1.12 -> ciflow/periodic/release_1.12 2025-12-04T09:17:11.8907748Z * [new tag] ciflow/periodic/release_1.12.0 -> ciflow/periodic/release_1.12.0 2025-12-04T09:17:11.8909414Z * [new tag] ciflow/periodic/sha-ec5b83 -> ciflow/periodic/sha-ec5b83 2025-12-04T09:17:11.8910921Z * [new tag] ciflow/pull/167207 -> ciflow/pull/167207 2025-12-04T09:17:11.8912726Z * [new tag] ciflow/quantization-periodic/169207 -> ciflow/quantization-periodic/169207 2025-12-04T09:17:11.8914280Z * [new tag] ciflow/rocm-mi200/165545 -> ciflow/rocm-mi200/165545 2025-12-04T09:17:11.8915337Z * [new tag] ciflow/rocm-mi200/165997 -> ciflow/rocm-mi200/165997 2025-12-04T09:17:11.8916857Z * [new tag] ciflow/rocm-mi200/168096 -> ciflow/rocm-mi200/168096 2025-12-04T09:17:11.8918066Z * [new tag] ciflow/rocm-mi200/168275 -> ciflow/rocm-mi200/168275 2025-12-04T09:17:11.8919315Z * [new tag] ciflow/rocm-mi200/169063 -> ciflow/rocm-mi200/169063 2025-12-04T09:17:11.8921030Z * [new tag] ciflow/rocm-mi200/169356 -> ciflow/rocm-mi200/169356 2025-12-04T09:17:11.8922072Z * [new tag] ciflow/rocm-mi200/169425 -> ciflow/rocm-mi200/169425 2025-12-04T09:17:11.8923925Z * [new tag] ciflow/rocm-mi300/165545 -> ciflow/rocm-mi300/165545 2025-12-04T09:17:11.8925114Z * [new tag] ciflow/rocm-mi300/167157 -> ciflow/rocm-mi300/167157 2025-12-04T09:17:11.8927042Z * [new tag] ciflow/rocm-mi300/168096 -> ciflow/rocm-mi300/168096 2025-12-04T09:17:11.8928101Z * [new tag] ciflow/rocm-mi300/169063 -> ciflow/rocm-mi300/169063 2025-12-04T09:17:11.8929314Z * [new tag] ciflow/rocm-mi300/169425 -> ciflow/rocm-mi300/169425 2025-12-04T09:17:11.8931026Z * [new tag] ciflow/rocm-mi355/167157 -> ciflow/rocm-mi355/167157 2025-12-04T09:17:11.8932217Z * [new tag] ciflow/rocm-mi355/168275 -> ciflow/rocm-mi355/168275 2025-12-04T09:17:11.8933424Z * [new tag] ciflow/rocm-mi355/169425 -> ciflow/rocm-mi355/169425 2025-12-04T09:17:11.8935351Z * [new tag] ciflow/rocm-navi31/168275 -> ciflow/rocm-navi31/168275 2025-12-04T09:17:11.8936263Z * [new tag] ciflow/rocm-navi31/169425 -> ciflow/rocm-navi31/169425 2025-12-04T09:17:11.8938061Z * [new tag] ciflow/rocm/115316 -> ciflow/rocm/115316 2025-12-04T09:17:11.8939003Z * [new tag] ciflow/rocm/148492 -> ciflow/rocm/148492 2025-12-04T09:17:11.8940388Z * [new tag] ciflow/rocm/160685 -> ciflow/rocm/160685 2025-12-04T09:17:11.8941471Z * [new tag] ciflow/rocm/161607 -> ciflow/rocm/161607 2025-12-04T09:17:11.8942866Z * [new tag] ciflow/rocm/162052 -> ciflow/rocm/162052 2025-12-04T09:17:11.8943898Z * [new tag] ciflow/rocm/165997 -> ciflow/rocm/165997 2025-12-04T09:17:11.8945479Z * [new tag] ciflow/rocm/166165 -> ciflow/rocm/166165 2025-12-04T09:17:11.8946373Z * [new tag] ciflow/rocm/166517 -> ciflow/rocm/166517 2025-12-04T09:17:11.8947803Z * [new tag] ciflow/rocm/167207 -> ciflow/rocm/167207 2025-12-04T09:17:11.8948877Z * [new tag] ciflow/rocm/167536 -> ciflow/rocm/167536 2025-12-04T09:17:11.8950146Z * [new tag] ciflow/rocm/167781 -> ciflow/rocm/167781 2025-12-04T09:17:11.8952040Z * [new tag] ciflow/rocm/167989 -> ciflow/rocm/167989 2025-12-04T09:17:11.8953668Z * [new tag] ciflow/rocm/168073 -> ciflow/rocm/168073 2025-12-04T09:17:11.8955243Z * [new tag] ciflow/rocm/168195 -> ciflow/rocm/168195 2025-12-04T09:17:11.8956355Z * [new tag] ciflow/rocm/168939 -> ciflow/rocm/168939 2025-12-04T09:17:11.8957868Z * [new tag] ciflow/rocm/168971 -> ciflow/rocm/168971 2025-12-04T09:17:11.8959020Z * [new tag] ciflow/rocm/169024 -> ciflow/rocm/169024 2025-12-04T09:17:11.8960694Z * [new tag] ciflow/rocm/169200 -> ciflow/rocm/169200 2025-12-04T09:17:11.8961786Z * [new tag] ciflow/rocm/169216 -> ciflow/rocm/169216 2025-12-04T09:17:11.8963261Z * [new tag] ciflow/rocm/169312 -> ciflow/rocm/169312 2025-12-04T09:17:11.8964409Z * [new tag] ciflow/rocm/169380 -> ciflow/rocm/169380 2025-12-04T09:17:11.8965855Z * [new tag] ciflow/rocm/169427 -> ciflow/rocm/169427 2025-12-04T09:17:11.8966972Z * [new tag] ciflow/rocm/169455 -> ciflow/rocm/169455 2025-12-04T09:17:11.8968517Z * [new tag] ciflow/rocm/169470 -> ciflow/rocm/169470 2025-12-04T09:17:11.8969675Z * [new tag] ciflow/rocm/169471 -> ciflow/rocm/169471 2025-12-04T09:17:11.8971188Z * [new tag] ciflow/rocm/169472 -> ciflow/rocm/169472 2025-12-04T09:17:11.8972333Z * [new tag] ciflow/rocm/169514 -> ciflow/rocm/169514 2025-12-04T09:17:11.8974275Z * [new tag] ciflow/slow/01c7106 -> ciflow/slow/01c7106 2025-12-04T09:17:11.8975474Z * [new tag] ciflow/slow/0577043 -> ciflow/slow/0577043 2025-12-04T09:17:11.8977500Z * [new tag] ciflow/slow/0d5b74da0cab798fbfdb9caa53fad816999c8386-sdym -> ciflow/slow/0d5b74da0cab798fbfdb9caa53fad816999c8386-sdym 2025-12-04T09:17:11.8978445Z * [new tag] ciflow/slow/0e81104 -> ciflow/slow/0e81104 2025-12-04T09:17:11.8979887Z * [new tag] ciflow/slow/167207 -> ciflow/slow/167207 2025-12-04T09:17:11.8980916Z * [new tag] ciflow/slow/168050 -> ciflow/slow/168050 2025-12-04T09:17:11.8982507Z * [new tag] ciflow/slow/1732077 -> ciflow/slow/1732077 2025-12-04T09:17:11.8984070Z * [new tag] ciflow/slow/187eb7c -> ciflow/slow/187eb7c 2025-12-04T09:17:11.8985799Z * [new tag] ciflow/slow/1faef89 -> ciflow/slow/1faef89 2025-12-04T09:17:11.8987580Z * [new tag] ciflow/slow/3920ec1 -> ciflow/slow/3920ec1 2025-12-04T09:17:11.8989300Z * [new tag] ciflow/slow/3b7c6b2 -> ciflow/slow/3b7c6b2 2025-12-04T09:17:11.8990774Z * [new tag] ciflow/slow/59a3759 -> ciflow/slow/59a3759 2025-12-04T09:17:11.8992247Z * [new tag] ciflow/slow/70ef0bb -> ciflow/slow/70ef0bb 2025-12-04T09:17:11.8993708Z * [new tag] ciflow/slow/788ff06 -> ciflow/slow/788ff06 2025-12-04T09:17:11.8995668Z * [new tag] ciflow/slow/8751002215790a3a88750faa8f4366933e296693-sdym -> ciflow/slow/8751002215790a3a88750faa8f4366933e296693-sdym 2025-12-04T09:17:11.9003101Z * [new tag] ciflow/slow/9d85864 -> ciflow/slow/9d85864 2025-12-04T09:17:11.9003763Z * [new tag] ciflow/slow/9ffad5b -> ciflow/slow/9ffad5b 2025-12-04T09:17:11.9004210Z * [new tag] ciflow/slow/a206e8b -> ciflow/slow/a206e8b 2025-12-04T09:17:11.9004641Z * [new tag] ciflow/slow/a837609 -> ciflow/slow/a837609 2025-12-04T09:17:11.9005204Z * [new tag] ciflow/slow/af841f3 -> ciflow/slow/af841f3 2025-12-04T09:17:11.9007489Z * [new tag] ciflow/slow/da3aba1e46157c4df504b067477cdf2b3c96b194-sdym -> ciflow/slow/da3aba1e46157c4df504b067477cdf2b3c96b194-sdym 2025-12-04T09:17:11.9008616Z * [new tag] ciflow/torchbench/168175 -> ciflow/torchbench/168175 2025-12-04T09:17:11.9010320Z * [new tag] ciflow/trunk/148492 -> ciflow/trunk/148492 2025-12-04T09:17:11.9011423Z * [new tag] ciflow/trunk/157149 -> ciflow/trunk/157149 2025-12-04T09:17:11.9012887Z * [new tag] ciflow/trunk/157994 -> ciflow/trunk/157994 2025-12-04T09:17:11.9013933Z * [new tag] ciflow/trunk/159718 -> ciflow/trunk/159718 2025-12-04T09:17:11.9015354Z * [new tag] ciflow/trunk/160685 -> ciflow/trunk/160685 2025-12-04T09:17:11.9016419Z * [new tag] ciflow/trunk/160729 -> ciflow/trunk/160729 2025-12-04T09:17:11.9017886Z * [new tag] ciflow/trunk/162275 -> ciflow/trunk/162275 2025-12-04T09:17:11.9018922Z * [new tag] ciflow/trunk/162795 -> ciflow/trunk/162795 2025-12-04T09:17:11.9020391Z * [new tag] ciflow/trunk/163245 -> ciflow/trunk/163245 2025-12-04T09:17:11.9021455Z * [new tag] ciflow/trunk/163942 -> ciflow/trunk/163942 2025-12-04T09:17:11.9023181Z * [new tag] ciflow/trunk/165274 -> ciflow/trunk/165274 2025-12-04T09:17:11.9024806Z * [new tag] ciflow/trunk/165483 -> ciflow/trunk/165483 2025-12-04T09:17:11.9026617Z * [new tag] ciflow/trunk/165728 -> ciflow/trunk/165728 2025-12-04T09:17:11.9028247Z * [new tag] ciflow/trunk/165922 -> ciflow/trunk/165922 2025-12-04T09:17:11.9029442Z * [new tag] ciflow/trunk/166075 -> ciflow/trunk/166075 2025-12-04T09:17:11.9030955Z * [new tag] ciflow/trunk/166165 -> ciflow/trunk/166165 2025-12-04T09:17:11.9032332Z * [new tag] ciflow/trunk/166829 -> ciflow/trunk/166829 2025-12-04T09:17:11.9033787Z * [new tag] ciflow/trunk/166843 -> ciflow/trunk/166843 2025-12-04T09:17:11.9035160Z * [new tag] ciflow/trunk/166876 -> ciflow/trunk/166876 2025-12-04T09:17:11.9036320Z * [new tag] ciflow/trunk/167207 -> ciflow/trunk/167207 2025-12-04T09:17:11.9038039Z * [new tag] ciflow/trunk/167536 -> ciflow/trunk/167536 2025-12-04T09:17:11.9039059Z * [new tag] ciflow/trunk/167552 -> ciflow/trunk/167552 2025-12-04T09:17:11.9040688Z * [new tag] ciflow/trunk/167555 -> ciflow/trunk/167555 2025-12-04T09:17:11.9042203Z * [new tag] ciflow/trunk/167599 -> ciflow/trunk/167599 2025-12-04T09:17:11.9043398Z * [new tag] ciflow/trunk/167659 -> ciflow/trunk/167659 2025-12-04T09:17:11.9045002Z * [new tag] ciflow/trunk/167672 -> ciflow/trunk/167672 2025-12-04T09:17:11.9046159Z * [new tag] ciflow/trunk/167742 -> ciflow/trunk/167742 2025-12-04T09:17:11.9047681Z * [new tag] ciflow/trunk/167781 -> ciflow/trunk/167781 2025-12-04T09:17:11.9049197Z * [new tag] ciflow/trunk/167837 -> ciflow/trunk/167837 2025-12-04T09:17:11.9050360Z * [new tag] ciflow/trunk/167887 -> ciflow/trunk/167887 2025-12-04T09:17:11.9051923Z * [new tag] ciflow/trunk/167978 -> ciflow/trunk/167978 2025-12-04T09:17:11.9053162Z * [new tag] ciflow/trunk/168050 -> ciflow/trunk/168050 2025-12-04T09:17:11.9054581Z * [new tag] ciflow/trunk/168051 -> ciflow/trunk/168051 2025-12-04T09:17:11.9055713Z * [new tag] ciflow/trunk/168096 -> ciflow/trunk/168096 2025-12-04T09:17:11.9057246Z * [new tag] ciflow/trunk/168127 -> ciflow/trunk/168127 2025-12-04T09:17:11.9058440Z * [new tag] ciflow/trunk/168157 -> ciflow/trunk/168157 2025-12-04T09:17:11.9059908Z * [new tag] ciflow/trunk/168175 -> ciflow/trunk/168175 2025-12-04T09:17:11.9061055Z * [new tag] ciflow/trunk/168209 -> ciflow/trunk/168209 2025-12-04T09:17:11.9062737Z * [new tag] ciflow/trunk/168213 -> ciflow/trunk/168213 2025-12-04T09:17:11.9064205Z * [new tag] ciflow/trunk/168226 -> ciflow/trunk/168226 2025-12-04T09:17:11.9065371Z * [new tag] ciflow/trunk/168262 -> ciflow/trunk/168262 2025-12-04T09:17:11.9066872Z * [new tag] ciflow/trunk/168275 -> ciflow/trunk/168275 2025-12-04T09:17:11.9068371Z * [new tag] ciflow/trunk/168328 -> ciflow/trunk/168328 2025-12-04T09:17:11.9069531Z * [new tag] ciflow/trunk/168368 -> ciflow/trunk/168368 2025-12-04T09:17:11.9071096Z * [new tag] ciflow/trunk/168917 -> ciflow/trunk/168917 2025-12-04T09:17:11.9072241Z * [new tag] ciflow/trunk/168933 -> ciflow/trunk/168933 2025-12-04T09:17:11.9073894Z * [new tag] ciflow/trunk/168941 -> ciflow/trunk/168941 2025-12-04T09:17:11.9075041Z * [new tag] ciflow/trunk/168955 -> ciflow/trunk/168955 2025-12-04T09:17:11.9076575Z * [new tag] ciflow/trunk/168980 -> ciflow/trunk/168980 2025-12-04T09:17:11.9078103Z * [new tag] ciflow/trunk/169004 -> ciflow/trunk/169004 2025-12-04T09:17:11.9079225Z * [new tag] ciflow/trunk/169006 -> ciflow/trunk/169006 2025-12-04T09:17:11.9080896Z * [new tag] ciflow/trunk/169023 -> ciflow/trunk/169023 2025-12-04T09:17:11.9082034Z * [new tag] ciflow/trunk/169025 -> ciflow/trunk/169025 2025-12-04T09:17:11.9083625Z * [new tag] ciflow/trunk/169048 -> ciflow/trunk/169048 2025-12-04T09:17:11.9084812Z * [new tag] ciflow/trunk/169066 -> ciflow/trunk/169066 2025-12-04T09:17:11.9086418Z * [new tag] ciflow/trunk/169091 -> ciflow/trunk/169091 2025-12-04T09:17:11.9087523Z * [new tag] ciflow/trunk/169102 -> ciflow/trunk/169102 2025-12-04T09:17:11.9089123Z * [new tag] ciflow/trunk/169103 -> ciflow/trunk/169103 2025-12-04T09:17:11.9090600Z * [new tag] ciflow/trunk/169125 -> ciflow/trunk/169125 2025-12-04T09:17:11.9092794Z * [new tag] ciflow/trunk/169139 -> ciflow/trunk/169139 2025-12-04T09:17:11.9094276Z * [new tag] ciflow/trunk/169148 -> ciflow/trunk/169148 2025-12-04T09:17:11.9095674Z * [new tag] ciflow/trunk/169151 -> ciflow/trunk/169151 2025-12-04T09:17:11.9096854Z * [new tag] ciflow/trunk/169156 -> ciflow/trunk/169156 2025-12-04T09:17:11.9098640Z * [new tag] ciflow/trunk/169176 -> ciflow/trunk/169176 2025-12-04T09:17:11.9099723Z * [new tag] ciflow/trunk/169204 -> ciflow/trunk/169204 2025-12-04T09:17:11.9101525Z * [new tag] ciflow/trunk/169207 -> ciflow/trunk/169207 2025-12-04T09:17:11.9102595Z * [new tag] ciflow/trunk/169211 -> ciflow/trunk/169211 2025-12-04T09:17:11.9104374Z * [new tag] ciflow/trunk/169231 -> ciflow/trunk/169231 2025-12-04T09:17:11.9105679Z * [new tag] ciflow/trunk/169260 -> ciflow/trunk/169260 2025-12-04T09:17:11.9107358Z * [new tag] ciflow/trunk/169271 -> ciflow/trunk/169271 2025-12-04T09:17:11.9108445Z * [new tag] ciflow/trunk/169280 -> ciflow/trunk/169280 2025-12-04T09:17:11.9110013Z * [new tag] ciflow/trunk/169281 -> ciflow/trunk/169281 2025-12-04T09:17:11.9111171Z * [new tag] ciflow/trunk/169286 -> ciflow/trunk/169286 2025-12-04T09:17:11.9112888Z * [new tag] ciflow/trunk/169293 -> ciflow/trunk/169293 2025-12-04T09:17:11.9114028Z * [new tag] ciflow/trunk/169296 -> ciflow/trunk/169296 2025-12-04T09:17:11.9115558Z * [new tag] ciflow/trunk/169304 -> ciflow/trunk/169304 2025-12-04T09:17:11.9116705Z * [new tag] ciflow/trunk/169305 -> ciflow/trunk/169305 2025-12-04T09:17:11.9118318Z * [new tag] ciflow/trunk/169312 -> ciflow/trunk/169312 2025-12-04T09:17:11.9120051Z * [new tag] ciflow/trunk/169328 -> ciflow/trunk/169328 2025-12-04T09:17:11.9121176Z * [new tag] ciflow/trunk/169343 -> ciflow/trunk/169343 2025-12-04T09:17:11.9122684Z * [new tag] ciflow/trunk/169355 -> ciflow/trunk/169355 2025-12-04T09:17:11.9124077Z * [new tag] ciflow/trunk/169370 -> ciflow/trunk/169370 2025-12-04T09:17:11.9125548Z * [new tag] ciflow/trunk/169379 -> ciflow/trunk/169379 2025-12-04T09:17:11.9126683Z * [new tag] ciflow/trunk/169380 -> ciflow/trunk/169380 2025-12-04T09:17:11.9128249Z * [new tag] ciflow/trunk/169385 -> ciflow/trunk/169385 2025-12-04T09:17:11.9129458Z * [new tag] ciflow/trunk/169387 -> ciflow/trunk/169387 2025-12-04T09:17:11.9131123Z * [new tag] ciflow/trunk/169410 -> ciflow/trunk/169410 2025-12-04T09:17:11.9132569Z * [new tag] ciflow/trunk/169412 -> ciflow/trunk/169412 2025-12-04T09:17:11.9133724Z * [new tag] ciflow/trunk/169418 -> ciflow/trunk/169418 2025-12-04T09:17:11.9135268Z * [new tag] ciflow/trunk/169423 -> ciflow/trunk/169423 2025-12-04T09:17:11.9136423Z * [new tag] ciflow/trunk/169427 -> ciflow/trunk/169427 2025-12-04T09:17:11.9137936Z * [new tag] ciflow/trunk/169430 -> ciflow/trunk/169430 2025-12-04T09:17:11.9139110Z * [new tag] ciflow/trunk/169437 -> ciflow/trunk/169437 2025-12-04T09:17:11.9140663Z * [new tag] ciflow/trunk/169442 -> ciflow/trunk/169442 2025-12-04T09:17:11.9141804Z * [new tag] ciflow/trunk/169452 -> ciflow/trunk/169452 2025-12-04T09:17:11.9143361Z * [new tag] ciflow/trunk/169454 -> ciflow/trunk/169454 2025-12-04T09:17:11.9144512Z * [new tag] ciflow/trunk/169459 -> ciflow/trunk/169459 2025-12-04T09:17:11.9146173Z * [new tag] ciflow/trunk/169474 -> ciflow/trunk/169474 2025-12-04T09:17:11.9147349Z * [new tag] ciflow/trunk/169475 -> ciflow/trunk/169475 2025-12-04T09:17:11.9148929Z * [new tag] ciflow/trunk/169476 -> ciflow/trunk/169476 2025-12-04T09:17:11.9150450Z * [new tag] ciflow/trunk/169487 -> ciflow/trunk/169487 2025-12-04T09:17:11.9152155Z * [new tag] ciflow/trunk/169497 -> ciflow/trunk/169497 2025-12-04T09:17:11.9153328Z * [new tag] ciflow/trunk/169503 -> ciflow/trunk/169503 2025-12-04T09:17:11.9154834Z * [new tag] ciflow/trunk/169505 -> ciflow/trunk/169505 2025-12-04T09:17:11.9156002Z * [new tag] ciflow/trunk/169507 -> ciflow/trunk/169507 2025-12-04T09:17:11.9157526Z * [new tag] ciflow/trunk/169514 -> ciflow/trunk/169514 2025-12-04T09:17:11.9158910Z * [new tag] ciflow/trunk/169517 -> ciflow/trunk/169517 2025-12-04T09:17:11.9160412Z * [new tag] ciflow/trunk/169519 -> ciflow/trunk/169519 2025-12-04T09:17:11.9161582Z * [new tag] ciflow/trunk/169528 -> ciflow/trunk/169528 2025-12-04T09:17:11.9163140Z * [new tag] ciflow/trunk/169541 -> ciflow/trunk/169541 2025-12-04T09:17:11.9164584Z * [new tag] ciflow/trunk/169555 -> ciflow/trunk/169555 2025-12-04T09:17:11.9166492Z * [new tag] ciflow/unstable/123 -> ciflow/unstable/123 2025-12-04T09:17:11.9168041Z * [new tag] ciflow/vllm/165270 -> ciflow/vllm/165270 2025-12-04T09:17:11.9169123Z * [new tag] ciflow/vllm/165274 -> ciflow/vllm/165274 2025-12-04T09:17:11.9170514Z * [new tag] ciflow/vllm/166494 -> ciflow/vllm/166494 2025-12-04T09:17:11.9171607Z * [new tag] ciflow/vllm/169219 -> ciflow/vllm/169219 2025-12-04T09:17:11.9173045Z * [new tag] ciflow/vllm/169220 -> ciflow/vllm/169220 2025-12-04T09:17:11.9174725Z * [new tag] ciflow/xpu/157994 -> ciflow/xpu/157994 2025-12-04T09:17:11.9175733Z * [new tag] ciflow/xpu/159718 -> ciflow/xpu/159718 2025-12-04T09:17:11.9177160Z * [new tag] ciflow/xpu/161940 -> ciflow/xpu/161940 2025-12-04T09:17:11.9178639Z * [new tag] ciflow/xpu/163251 -> ciflow/xpu/163251 2025-12-04T09:17:11.9179651Z * [new tag] ciflow/xpu/166829 -> ciflow/xpu/166829 2025-12-04T09:17:11.9181060Z * [new tag] ciflow/xpu/166843 -> ciflow/xpu/166843 2025-12-04T09:17:11.9182124Z * [new tag] ciflow/xpu/167972 -> ciflow/xpu/167972 2025-12-04T09:17:11.9183552Z * [new tag] ciflow/xpu/167981 -> ciflow/xpu/167981 2025-12-04T09:17:11.9184598Z * [new tag] ciflow/xpu/168213 -> ciflow/xpu/168213 2025-12-04T09:17:11.9186009Z * [new tag] ciflow/xpu/168262 -> ciflow/xpu/168262 2025-12-04T09:17:11.9187054Z * [new tag] ciflow/xpu/168328 -> ciflow/xpu/168328 2025-12-04T09:17:11.9188905Z * [new tag] ciflow/xpu/168950 -> ciflow/xpu/168950 2025-12-04T09:17:11.9190718Z * [new tag] ciflow/xpu/169039 -> ciflow/xpu/169039 2025-12-04T09:17:11.9192266Z * [new tag] ciflow/xpu/169200 -> ciflow/xpu/169200 2025-12-04T09:17:11.9193416Z * [new tag] ciflow/xpu/169203 -> ciflow/xpu/169203 2025-12-04T09:17:11.9194917Z * [new tag] ciflow/xpu/169230 -> ciflow/xpu/169230 2025-12-04T09:17:11.9196041Z * [new tag] ciflow/xpu/169231 -> ciflow/xpu/169231 2025-12-04T09:17:11.9197684Z * [new tag] ciflow/xpu/169241 -> ciflow/xpu/169241 2025-12-04T09:17:11.9198879Z * [new tag] ciflow/xpu/169280 -> ciflow/xpu/169280 2025-12-04T09:17:11.9200714Z * [new tag] ciflow/xpu/169296 -> ciflow/xpu/169296 2025-12-04T09:17:11.9202304Z * [new tag] ciflow/xpu/169353 -> ciflow/xpu/169353 2025-12-04T09:17:11.9204093Z * [new tag] ciflow/xpu/169410 -> ciflow/xpu/169410 2025-12-04T09:17:11.9205248Z * [new tag] ciflow/xpu/169442 -> ciflow/xpu/169442 2025-12-04T09:17:11.9206818Z * [new tag] ciflow/xpu/169555 -> ciflow/xpu/169555 2025-12-04T09:17:11.9208241Z * [new tag] cslpull75 -> cslpull75 2025-12-04T09:17:11.9209407Z * [new tag] cslpull76 -> cslpull76 2025-12-04T09:17:11.9210812Z * [new tag] cslpull77 -> cslpull77 2025-12-04T09:17:11.9212329Z * [new tag] cslpull78 -> cslpull78 2025-12-04T09:17:11.9213823Z * [new tag] cslpull79 -> cslpull79 2025-12-04T09:17:11.9215500Z * [new tag] cslpull80 -> cslpull80 2025-12-04T09:17:11.9216909Z * [new tag] cslpull81 -> cslpull81 2025-12-04T09:17:11.9218490Z * [new tag] cslpull82 -> cslpull82 2025-12-04T09:17:11.9219882Z * [new tag] cslpull83 -> cslpull83 2025-12-04T09:17:11.9221216Z * [new tag] cslpull84 -> cslpull84 2025-12-04T09:17:11.9222553Z * [new tag] cslpull85 -> cslpull85 2025-12-04T09:17:11.9223951Z * [new tag] cslpull86 -> cslpull86 2025-12-04T09:17:11.9225376Z * [new tag] cslpull87 -> cslpull87 2025-12-04T09:17:11.9226818Z * [new tag] cslpull88 -> cslpull88 2025-12-04T09:17:11.9228212Z * [new tag] cslpull89 -> cslpull89 2025-12-04T09:17:11.9229253Z * [new tag] cslpull90 -> cslpull90 2025-12-04T09:17:11.9231193Z * [new tag] cslpull91 -> cslpull91 2025-12-04T09:17:11.9232529Z * [new tag] cslpull92 -> cslpull92 2025-12-04T09:17:11.9234035Z * [new tag] flight_5 -> flight_5 2025-12-04T09:17:11.9235606Z * [new tag] flight_5.1 -> flight_5.1 2025-12-04T09:17:11.9237042Z * [new tag] flight_5.2 -> flight_5.2 2025-12-04T09:17:11.9238506Z * [new tag] flight_5.3 -> flight_5.3 2025-12-04T09:17:11.9240038Z * [new tag] forpull1 -> forpull1 2025-12-04T09:17:11.9241819Z * [new tag] malfet/tag-2ef5611 -> malfet/tag-2ef5611 2025-12-04T09:17:11.9243358Z * [new tag] malfet/tag-317b1a0 -> malfet/tag-317b1a0 2025-12-04T09:17:11.9244534Z * [new tag] malfet/tag-ec6f767 -> malfet/tag-ec6f767 2025-12-04T09:17:11.9246302Z * [new tag] nightly-binary -> nightly-binary 2025-12-04T09:17:11.9247461Z * [new tag] sqzhang_flight4_plus -> sqzhang_flight4_plus 2025-12-04T09:17:11.9249237Z * [new tag] sqzhang_flight_3 -> sqzhang_flight_3 2025-12-04T09:17:11.9251076Z * [new tag] trunk/02d8bd6974cf84b721680d773dbdb1b6f40ce272 -> trunk/02d8bd6974cf84b721680d773dbdb1b6f40ce272 2025-12-04T09:17:11.9252317Z * [new tag] trunk/066997fb38ade71e00d78e9d572e380b5f02bd3e -> trunk/066997fb38ade71e00d78e9d572e380b5f02bd3e 2025-12-04T09:17:11.9254238Z * [new tag] trunk/076e7b19fa1d481ad778d06d2b49ba57d3ce8c88 -> trunk/076e7b19fa1d481ad778d06d2b49ba57d3ce8c88 2025-12-04T09:17:11.9255732Z * [new tag] trunk/07dcc0b83db3211653a38565a24e15acdba75654 -> trunk/07dcc0b83db3211653a38565a24e15acdba75654 2025-12-04T09:17:11.9257406Z * [new tag] trunk/082e96b68dfcd16cab7cfafc4d3d055767dab3eb -> trunk/082e96b68dfcd16cab7cfafc4d3d055767dab3eb 2025-12-04T09:17:11.9258628Z * [new tag] trunk/088048f2fea28ff7d450f65c72419ca45780d30b -> trunk/088048f2fea28ff7d450f65c72419ca45780d30b 2025-12-04T09:17:11.9260095Z * [new tag] trunk/09076941a95c76f4d9ad189d064dfd8baa39e672 -> trunk/09076941a95c76f4d9ad189d064dfd8baa39e672 2025-12-04T09:17:11.9261507Z * [new tag] trunk/0b80a4c62b94402844bf221791c096b0035c6d75 -> trunk/0b80a4c62b94402844bf221791c096b0035c6d75 2025-12-04T09:17:11.9263407Z * [new tag] trunk/0bbbdf1750567a980634ad907a325357ba8ba8f2 -> trunk/0bbbdf1750567a980634ad907a325357ba8ba8f2 2025-12-04T09:17:11.9264773Z * [new tag] trunk/0c281dd78773b2bc17c58ead0e4cd4ac46e775c5 -> trunk/0c281dd78773b2bc17c58ead0e4cd4ac46e775c5 2025-12-04T09:17:11.9266158Z * [new tag] trunk/135f3753c418a6879b1954904184937b67e61688 -> trunk/135f3753c418a6879b1954904184937b67e61688 2025-12-04T09:17:11.9267632Z * [new tag] trunk/15da21026cb13cd20257dc9e96830db108743c10 -> trunk/15da21026cb13cd20257dc9e96830db108743c10 2025-12-04T09:17:11.9269141Z * [new tag] trunk/166efdad2ac827f30fb02504c6017520257f88ec -> trunk/166efdad2ac827f30fb02504c6017520257f88ec 2025-12-04T09:17:11.9270556Z * [new tag] trunk/174272c15fae553d8488140af931f7d8050a313f -> trunk/174272c15fae553d8488140af931f7d8050a313f 2025-12-04T09:17:11.9272436Z * [new tag] trunk/18f3ca08f13b8de61307f5e8cd7d4cccb67e9d11 -> trunk/18f3ca08f13b8de61307f5e8cd7d4cccb67e9d11 2025-12-04T09:17:11.9273704Z * [new tag] trunk/1902eddfe655a15ebcf2c72bd81ade110fdeef63 -> trunk/1902eddfe655a15ebcf2c72bd81ade110fdeef63 2025-12-04T09:17:11.9275105Z * [new tag] trunk/195f92e98d3d66738577f11f22c4b5c8a1c76dd5 -> trunk/195f92e98d3d66738577f11f22c4b5c8a1c76dd5 2025-12-04T09:17:11.9276575Z * [new tag] trunk/1aa13e17de39e3c768ea7aebaad166ce72a06676 -> trunk/1aa13e17de39e3c768ea7aebaad166ce72a06676 2025-12-04T09:17:11.9278041Z * [new tag] trunk/1afe2832f58e24e54a5bfda5a5afa9b96fdea40e -> trunk/1afe2832f58e24e54a5bfda5a5afa9b96fdea40e 2025-12-04T09:17:11.9279529Z * [new tag] trunk/1c87554d74140eaee964ca8b1832cede67f5f520 -> trunk/1c87554d74140eaee964ca8b1832cede67f5f520 2025-12-04T09:17:11.9281388Z * [new tag] trunk/1ccb743b7b5be955f49736c162c4f5004b8a0dd8 -> trunk/1ccb743b7b5be955f49736c162c4f5004b8a0dd8 2025-12-04T09:17:11.9282676Z * [new tag] trunk/1cee47d6ce0a02227185b566593f002dd639ca0c -> trunk/1cee47d6ce0a02227185b566593f002dd639ca0c 2025-12-04T09:17:11.9283999Z * [new tag] trunk/1d21b4df2babe322e5d085ceb6de884eb260a62d -> trunk/1d21b4df2babe322e5d085ceb6de884eb260a62d 2025-12-04T09:17:11.9285468Z * [new tag] trunk/1e34fb2550e4aa650314f7a6d9f6daf4da7478a8 -> trunk/1e34fb2550e4aa650314f7a6d9f6daf4da7478a8 2025-12-04T09:17:11.9287258Z * [new tag] trunk/1e526fb5b1d93bfc70691c5c3955fdffc1b7b7de -> trunk/1e526fb5b1d93bfc70691c5c3955fdffc1b7b7de 2025-12-04T09:17:11.9288598Z * [new tag] trunk/1ee32a8b1f554a312d79bad01ded24f38cd95543 -> trunk/1ee32a8b1f554a312d79bad01ded24f38cd95543 2025-12-04T09:17:11.9289958Z * [new tag] trunk/201e2c4117eb9744594dad6a5c18213d7b4705d7 -> trunk/201e2c4117eb9744594dad6a5c18213d7b4705d7 2025-12-04T09:17:11.9291409Z * [new tag] trunk/2353a0f60eb4b4cb6675907a7fa9fbedc1c02e7f -> trunk/2353a0f60eb4b4cb6675907a7fa9fbedc1c02e7f 2025-12-04T09:17:11.9293232Z * [new tag] trunk/285779b1621cf9f073a062b0889a642d200308d9 -> trunk/285779b1621cf9f073a062b0889a642d200308d9 2025-12-04T09:17:11.9294433Z * [new tag] trunk/2887faaec6295d081580d09fce161201826c6d87 -> trunk/2887faaec6295d081580d09fce161201826c6d87 2025-12-04T09:17:11.9295864Z * [new tag] trunk/296e67c92635443c67b11c0ae1bd045f03ebb7bc -> trunk/296e67c92635443c67b11c0ae1bd045f03ebb7bc 2025-12-04T09:17:11.9297306Z * [new tag] trunk/29856679769b3dede478767e2fe6cfb51197cb25 -> trunk/29856679769b3dede478767e2fe6cfb51197cb25 2025-12-04T09:17:11.9298868Z * [new tag] trunk/29e5455a4740c326ab187c7aa7b5ef98034ea563 -> trunk/29e5455a4740c326ab187c7aa7b5ef98034ea563 2025-12-04T09:17:11.9300501Z * [new tag] trunk/2ac3ef882afb23136adc188975f0a8802fc68adf -> trunk/2ac3ef882afb23136adc188975f0a8802fc68adf 2025-12-04T09:17:11.9301843Z * [new tag] trunk/2bec68e73b64715354af076ad309335f943e36cd -> trunk/2bec68e73b64715354af076ad309335f943e36cd 2025-12-04T09:17:11.9303257Z * [new tag] trunk/2c87367e6f88662cd5cedbd1537748b7948c38e1 -> trunk/2c87367e6f88662cd5cedbd1537748b7948c38e1 2025-12-04T09:17:11.9304902Z * [new tag] trunk/2d1f78fe3ec13820f136a2e0336da12a25f41708 -> trunk/2d1f78fe3ec13820f136a2e0336da12a25f41708 2025-12-04T09:17:11.9306316Z * [new tag] trunk/2df6058f116a65722a0e03073402feb242572d35 -> trunk/2df6058f116a65722a0e03073402feb242572d35 2025-12-04T09:17:11.9307836Z * [new tag] trunk/2e0c2e170fe658c440775c8e5c44228aafcc47ec -> trunk/2e0c2e170fe658c440775c8e5c44228aafcc47ec 2025-12-04T09:17:11.9309694Z * [new tag] trunk/2f9b7dad7b5419b063bd0f2e204de192720ebb94 -> trunk/2f9b7dad7b5419b063bd0f2e204de192720ebb94 2025-12-04T09:17:11.9311091Z * [new tag] trunk/305168768a95d69c444df5cd334bb774edfe06f1 -> trunk/305168768a95d69c444df5cd334bb774edfe06f1 2025-12-04T09:17:11.9312486Z * [new tag] trunk/31fc12773026e8e00f054dd79ad9b2491e693b48 -> trunk/31fc12773026e8e00f054dd79ad9b2491e693b48 2025-12-04T09:17:11.9313943Z * [new tag] trunk/320de0c6b0a3e7c6d2693ea5c28d5d0156ba7991 -> trunk/320de0c6b0a3e7c6d2693ea5c28d5d0156ba7991 2025-12-04T09:17:11.9315436Z * [new tag] trunk/3418bd29475dff06695045fcdf93e7d0dac67da8 -> trunk/3418bd29475dff06695045fcdf93e7d0dac67da8 2025-12-04T09:17:11.9316941Z * [new tag] trunk/34a98608afa0cb5b48f0d6d30432fdd0a2614ddf -> trunk/34a98608afa0cb5b48f0d6d30432fdd0a2614ddf 2025-12-04T09:17:11.9318413Z * [new tag] trunk/35b7a9a26c5923d98aebaa41a031dae21788a9ee -> trunk/35b7a9a26c5923d98aebaa41a031dae21788a9ee 2025-12-04T09:17:11.9320012Z * [new tag] trunk/39d07dbf03a911bdd45d1af78d8638dc92074938 -> trunk/39d07dbf03a911bdd45d1af78d8638dc92074938 2025-12-04T09:17:11.9321311Z * [new tag] trunk/3cd98b4205ada151042cc7ff097a82d4a4b18725 -> trunk/3cd98b4205ada151042cc7ff097a82d4a4b18725 2025-12-04T09:17:11.9322841Z * [new tag] trunk/3d35fd20a78ff4d016fa80f4e5fad37191d7bcae -> trunk/3d35fd20a78ff4d016fa80f4e5fad37191d7bcae 2025-12-04T09:17:11.9324280Z * [new tag] trunk/409a5fee945c46a3edaf5df162812f201bfd7b2f -> trunk/409a5fee945c46a3edaf5df162812f201bfd7b2f 2025-12-04T09:17:11.9325713Z * [new tag] trunk/42e9005cda22da3f1c559c3649218cebd671027c -> trunk/42e9005cda22da3f1c559c3649218cebd671027c 2025-12-04T09:17:11.9327172Z * [new tag] trunk/43b94713bbf340d3c124fde02d0f73add4021247 -> trunk/43b94713bbf340d3c124fde02d0f73add4021247 2025-12-04T09:17:11.9328620Z * [new tag] trunk/44ac69388a4a5eb463dbd2a13f00d1e3b924566c -> trunk/44ac69388a4a5eb463dbd2a13f00d1e3b924566c 2025-12-04T09:17:11.9330041Z * [new tag] trunk/45d14e2497292be06ad36eaa1aaaf7c630a2586a -> trunk/45d14e2497292be06ad36eaa1aaaf7c630a2586a 2025-12-04T09:17:11.9331427Z * [new tag] trunk/45d310ad84854dff730c0b12e577d7998d978686 -> trunk/45d310ad84854dff730c0b12e577d7998d978686 2025-12-04T09:17:11.9333422Z * [new tag] trunk/47b28ddf7bd74b50fa93b307a7d3b183a6d77f54 -> trunk/47b28ddf7bd74b50fa93b307a7d3b183a6d77f54 2025-12-04T09:17:11.9334471Z * [new tag] trunk/481e5ab336275bd3acd5fa8a611b05b4469012af -> trunk/481e5ab336275bd3acd5fa8a611b05b4469012af 2025-12-04T09:17:11.9335966Z * [new tag] trunk/491731647f6b8a9345dcfb3bc9416aea254a7d96 -> trunk/491731647f6b8a9345dcfb3bc9416aea254a7d96 2025-12-04T09:17:11.9338228Z * [new tag] trunk/49a04d26088acc17d948ddd66920f3e16371e873 -> trunk/49a04d26088acc17d948ddd66920f3e16371e873 2025-12-04T09:17:11.9339524Z * [new tag] trunk/4bebc827c47d2f1f0fa1a417a5201a97aef3d985 -> trunk/4bebc827c47d2f1f0fa1a417a5201a97aef3d985 2025-12-04T09:17:11.9340817Z * [new tag] trunk/4c246677784c6a14bc2dbb9ff8773ef0a3a3222f -> trunk/4c246677784c6a14bc2dbb9ff8773ef0a3a3222f 2025-12-04T09:17:11.9342628Z * [new tag] trunk/4cfb47ff548b6d996641058cf04a70e311a4c3aa -> trunk/4cfb47ff548b6d996641058cf04a70e311a4c3aa 2025-12-04T09:17:11.9344005Z * [new tag] trunk/4e0061c1aa52f606dda8cfab0bd7591e588faf2c -> trunk/4e0061c1aa52f606dda8cfab0bd7591e588faf2c 2025-12-04T09:17:11.9346045Z * [new tag] trunk/4fefb8e7e942386ffac764a41b232241f82bea3a -> trunk/4fefb8e7e942386ffac764a41b232241f82bea3a 2025-12-04T09:17:11.9347363Z * [new tag] trunk/503b2640023521f5a35cd9a52fc8033d73a95d0d -> trunk/503b2640023521f5a35cd9a52fc8033d73a95d0d 2025-12-04T09:17:11.9348882Z * [new tag] trunk/518c2b1b3dab9a2ef2849e04b3bc2f20c1c41db9 -> trunk/518c2b1b3dab9a2ef2849e04b3bc2f20c1c41db9 2025-12-04T09:17:11.9350326Z * [new tag] trunk/5191b2fa68ba19960912bfd7fd721c79d76bb1f3 -> trunk/5191b2fa68ba19960912bfd7fd721c79d76bb1f3 2025-12-04T09:17:11.9351869Z * [new tag] trunk/52ac0f0dc4acacd219f1317fbc28ec631c01e07a -> trunk/52ac0f0dc4acacd219f1317fbc28ec631c01e07a 2025-12-04T09:17:11.9353353Z * [new tag] trunk/539ba711b029de9f191070f4f0d12f18f5b7f292 -> trunk/539ba711b029de9f191070f4f0d12f18f5b7f292 2025-12-04T09:17:11.9354832Z * [new tag] trunk/556375b55deebebbc56cb7aef81f4d52f031ba28 -> trunk/556375b55deebebbc56cb7aef81f4d52f031ba28 2025-12-04T09:17:11.9356621Z * [new tag] trunk/55c4ab554845481d0a69a3811937575fe8bb1a66 -> trunk/55c4ab554845481d0a69a3811937575fe8bb1a66 2025-12-04T09:17:11.9357858Z * [new tag] trunk/5634469fda9e5d98869c82c7d03bb08914245f96 -> trunk/5634469fda9e5d98869c82c7d03bb08914245f96 2025-12-04T09:17:11.9359320Z * [new tag] trunk/5778f6ff894686a975a9a23645178ae4c87ad5dc -> trunk/5778f6ff894686a975a9a23645178ae4c87ad5dc 2025-12-04T09:17:11.9361154Z * [new tag] trunk/587d63a3e07de5dc91065f9ef70bcacda9989068 -> trunk/587d63a3e07de5dc91065f9ef70bcacda9989068 2025-12-04T09:17:11.9362438Z * [new tag] trunk/597930f6b568852356ca9795dac76f9e4653adbd -> trunk/597930f6b568852356ca9795dac76f9e4653adbd 2025-12-04T09:17:11.9363741Z * [new tag] trunk/597df3a4e2a67b9fdbe1a89b2f4d74f822274db6 -> trunk/597df3a4e2a67b9fdbe1a89b2f4d74f822274db6 2025-12-04T09:17:11.9365617Z * [new tag] trunk/59abd50e931f4efb21b053f7a2911f5d8a49d883 -> trunk/59abd50e931f4efb21b053f7a2911f5d8a49d883 2025-12-04T09:17:11.9366900Z * [new tag] trunk/5a607febc04c3a2b5824c75f3f60307867439a2c -> trunk/5a607febc04c3a2b5824c75f3f60307867439a2c 2025-12-04T09:17:11.9368456Z * [new tag] trunk/5bf1cdf4755c54ef462b44cb8041b0a57311556b -> trunk/5bf1cdf4755c54ef462b44cb8041b0a57311556b 2025-12-04T09:17:11.9369736Z * [new tag] trunk/5f0030ba63d334d7e8c93a09e41403b89e4c573c -> trunk/5f0030ba63d334d7e8c93a09e41403b89e4c573c 2025-12-04T09:17:11.9371208Z * [new tag] trunk/5f21d27e71268464d362a96c9ac09ea475f7f202 -> trunk/5f21d27e71268464d362a96c9ac09ea475f7f202 2025-12-04T09:17:11.9372849Z * [new tag] trunk/5fafc13038c9988d9ac21fa793fbd5890604b447 -> trunk/5fafc13038c9988d9ac21fa793fbd5890604b447 2025-12-04T09:17:11.9374309Z * [new tag] trunk/61be54a31dc09b59d99b62176fb935aee0b924ef -> trunk/61be54a31dc09b59d99b62176fb935aee0b924ef 2025-12-04T09:17:11.9375773Z * [new tag] trunk/62d3ccd71484ed6a760d909b41487101bbc65719 -> trunk/62d3ccd71484ed6a760d909b41487101bbc65719 2025-12-04T09:17:11.9377314Z * [new tag] trunk/641cdb68ae27668eb441d0e49c87a0602c120c2b -> trunk/641cdb68ae27668eb441d0e49c87a0602c120c2b 2025-12-04T09:17:11.9378789Z * [new tag] trunk/65c4620d6bb0c6029f69762c22b91dda2294da9a -> trunk/65c4620d6bb0c6029f69762c22b91dda2294da9a 2025-12-04T09:17:11.9380294Z * [new tag] trunk/66004b993744b4106bf8afaba71f3c228a804206 -> trunk/66004b993744b4106bf8afaba71f3c228a804206 2025-12-04T09:17:11.9381750Z * [new tag] trunk/6658a04c7ca67acb64512341342e7b3ee13ee386 -> trunk/6658a04c7ca67acb64512341342e7b3ee13ee386 2025-12-04T09:17:11.9383219Z * [new tag] trunk/6864e309092a71f8ab0ca6a4dc7f8a4073fd31c4 -> trunk/6864e309092a71f8ab0ca6a4dc7f8a4073fd31c4 2025-12-04T09:17:11.9384812Z * [new tag] trunk/6c261c6cb07892c90ca19ed51c9705b1659a3f7d -> trunk/6c261c6cb07892c90ca19ed51c9705b1659a3f7d 2025-12-04T09:17:11.9386176Z * [new tag] trunk/6c8b6a043f1628188b6396b3a2a6e000ca68362b -> trunk/6c8b6a043f1628188b6396b3a2a6e000ca68362b 2025-12-04T09:17:11.9387622Z * [new tag] trunk/6ceb4a32f92ae67ce5d7d97931d17401ebf5ffa5 -> trunk/6ceb4a32f92ae67ce5d7d97931d17401ebf5ffa5 2025-12-04T09:17:11.9389170Z * [new tag] trunk/6e404e9b7d6f5fb0de86aa73888c3038248c17f8 -> trunk/6e404e9b7d6f5fb0de86aa73888c3038248c17f8 2025-12-04T09:17:11.9390711Z * [new tag] trunk/6ec30b490aee1db6bcdc7340abddef25784f08ec -> trunk/6ec30b490aee1db6bcdc7340abddef25784f08ec 2025-12-04T09:17:11.9392222Z * [new tag] trunk/6f2783a6c08e1db34275ff25176ffe9aebc30a71 -> trunk/6f2783a6c08e1db34275ff25176ffe9aebc30a71 2025-12-04T09:17:11.9393694Z * [new tag] trunk/6f53fefeb90ad3281119b5cfc4aa9ffd8a066e3d -> trunk/6f53fefeb90ad3281119b5cfc4aa9ffd8a066e3d 2025-12-04T09:17:11.9395177Z * [new tag] trunk/6f7dcf51e46d0c880db1a2f5c70de57adb576f4a -> trunk/6f7dcf51e46d0c880db1a2f5c70de57adb576f4a 2025-12-04T09:17:11.9396698Z * [new tag] trunk/6ff831180d2fa436c7f1c1af3adac641fce9d60e -> trunk/6ff831180d2fa436c7f1c1af3adac641fce9d60e 2025-12-04T09:17:11.9398242Z * [new tag] trunk/70076464a63ab218a7ceefb0e76ccd7131deb8f8 -> trunk/70076464a63ab218a7ceefb0e76ccd7131deb8f8 2025-12-04T09:17:11.9399776Z * [new tag] trunk/70d797a5fc109b20a517646fcaa819477cd0d485 -> trunk/70d797a5fc109b20a517646fcaa819477cd0d485 2025-12-04T09:17:11.9403092Z * [new tag] trunk/7348cb355ff0a6f79cd4871215aea72185748734 -> trunk/7348cb355ff0a6f79cd4871215aea72185748734 2025-12-04T09:17:11.9404424Z * [new tag] trunk/74fe26a1ebe32931783569f2e762e3c2c974901f -> trunk/74fe26a1ebe32931783569f2e762e3c2c974901f 2025-12-04T09:17:11.9405919Z * [new tag] trunk/76aeb8c7e0f795b3fddca134cbea9a69da3ee696 -> trunk/76aeb8c7e0f795b3fddca134cbea9a69da3ee696 2025-12-04T09:17:11.9407259Z * [new tag] trunk/7716da9fb23f27a65b41f9f016a2afadf281c18f -> trunk/7716da9fb23f27a65b41f9f016a2afadf281c18f 2025-12-04T09:17:11.9408965Z * [new tag] trunk/7741edd4ed665f3988052e260863efb508d61a03 -> trunk/7741edd4ed665f3988052e260863efb508d61a03 2025-12-04T09:17:11.9410473Z * [new tag] trunk/78adb3b3df41b45d2368b67226d2f864b78939a6 -> trunk/78adb3b3df41b45d2368b67226d2f864b78939a6 2025-12-04T09:17:11.9411966Z * [new tag] trunk/79d7b178225e5ed24d4e1db74e5abbff848f5fb7 -> trunk/79d7b178225e5ed24d4e1db74e5abbff848f5fb7 2025-12-04T09:17:11.9413293Z * [new tag] trunk/7a1e316115fc6996b3f2336822ba5d5f6179f0c3 -> trunk/7a1e316115fc6996b3f2336822ba5d5f6179f0c3 2025-12-04T09:17:11.9414811Z * [new tag] trunk/7a41b66367c38d0af3e8a90f7be48d6b281e7bca -> trunk/7a41b66367c38d0af3e8a90f7be48d6b281e7bca 2025-12-04T09:17:11.9416245Z * [new tag] trunk/7b7af390ea8541c611d1ce2018a6934188fc197b -> trunk/7b7af390ea8541c611d1ce2018a6934188fc197b 2025-12-04T09:17:11.9417739Z * [new tag] trunk/7ba4680f3755a560af81aa0f688791e367aa3609 -> trunk/7ba4680f3755a560af81aa0f688791e367aa3609 2025-12-04T09:17:11.9419264Z * [new tag] trunk/7bc2a66ded06a0b2549aa51d807edc5dc3e73d1b -> trunk/7bc2a66ded06a0b2549aa51d807edc5dc3e73d1b 2025-12-04T09:17:11.9420566Z * [new tag] trunk/7c648509a7470ace9fb2bae960dd4790f7e943e9 -> trunk/7c648509a7470ace9fb2bae960dd4790f7e943e9 2025-12-04T09:17:11.9423250Z * [new tag] trunk/7cbc2d034cecd21ab5c9707d0a9c525c17143fb8 -> trunk/7cbc2d034cecd21ab5c9707d0a9c525c17143fb8 2025-12-04T09:17:11.9424082Z * [new tag] trunk/7d1bbaf4ba301ea3fba6f3c7bc02d58f6417aaed -> trunk/7d1bbaf4ba301ea3fba6f3c7bc02d58f6417aaed 2025-12-04T09:17:11.9425327Z * [new tag] trunk/7d2a33e4ebf60b217a3cd77feae19231eb996fc8 -> trunk/7d2a33e4ebf60b217a3cd77feae19231eb996fc8 2025-12-04T09:17:11.9426306Z * [new tag] trunk/7eb625920054b1126a7d2d99818aaa188c6ba95e -> trunk/7eb625920054b1126a7d2d99818aaa188c6ba95e 2025-12-04T09:17:11.9427596Z * [new tag] trunk/7f55ba19c456a3d6cc443dd9edb6bb7cca677ead -> trunk/7f55ba19c456a3d6cc443dd9edb6bb7cca677ead 2025-12-04T09:17:11.9429104Z * [new tag] trunk/81af382128efa094d8702e18f2c133760904c718 -> trunk/81af382128efa094d8702e18f2c133760904c718 2025-12-04T09:17:11.9431052Z * [new tag] trunk/84149583d483e9c973c9a0feda70e4f3964947b0 -> trunk/84149583d483e9c973c9a0feda70e4f3964947b0 2025-12-04T09:17:11.9432587Z * [new tag] trunk/85a315917efe82c24306be805c584ec044951c75 -> trunk/85a315917efe82c24306be805c584ec044951c75 2025-12-04T09:17:11.9434008Z * [new tag] trunk/87329491c82a5f8c1cc4ec11d8f55a5de2551ece -> trunk/87329491c82a5f8c1cc4ec11d8f55a5de2551ece 2025-12-04T09:17:11.9436034Z * [new tag] trunk/892640e25aeefa8007c5af837214b4502b6b62a6 -> trunk/892640e25aeefa8007c5af837214b4502b6b62a6 2025-12-04T09:17:11.9437539Z * [new tag] trunk/89e3bbcb5b5321dc8b9520b4d5a8ee60cea1d0b4 -> trunk/89e3bbcb5b5321dc8b9520b4d5a8ee60cea1d0b4 2025-12-04T09:17:11.9439031Z * [new tag] trunk/8c73bbbb02159223c0c97d268a0a74cb78158a1c -> trunk/8c73bbbb02159223c0c97d268a0a74cb78158a1c 2025-12-04T09:17:11.9440895Z * [new tag] trunk/8d56e98c8db988a22cb2dfaeefb30bc7d2a3cc43 -> trunk/8d56e98c8db988a22cb2dfaeefb30bc7d2a3cc43 2025-12-04T09:17:11.9442266Z * [new tag] trunk/8d9dd9603e5ee26c01007f0cd4f018e584840922 -> trunk/8d9dd9603e5ee26c01007f0cd4f018e584840922 2025-12-04T09:17:11.9443798Z * [new tag] trunk/8ef0c0b02b062d75e7c9be2594914a3e784d23ca -> trunk/8ef0c0b02b062d75e7c9be2594914a3e784d23ca 2025-12-04T09:17:11.9445309Z * [new tag] trunk/90b27e7e8352cde97d32ddad24740ef819633f38 -> trunk/90b27e7e8352cde97d32ddad24740ef819633f38 2025-12-04T09:17:11.9446662Z * [new tag] trunk/90f0139e64b2951815d524b6a373bed20c4fbf90 -> trunk/90f0139e64b2951815d524b6a373bed20c4fbf90 2025-12-04T09:17:11.9448013Z * [new tag] trunk/93d0d6838c56af59b0dba794e6aa08f0c1c7799c -> trunk/93d0d6838c56af59b0dba794e6aa08f0c1c7799c 2025-12-04T09:17:11.9449610Z * [new tag] trunk/94ca8d5f1e81fea3ae488650a0fb6795049a9f87 -> trunk/94ca8d5f1e81fea3ae488650a0fb6795049a9f87 2025-12-04T09:17:11.9451013Z * [new tag] trunk/9844fbeadd5cebdf1281d6fbf79164139c352693 -> trunk/9844fbeadd5cebdf1281d6fbf79164139c352693 2025-12-04T09:17:11.9452792Z * [new tag] trunk/99024dec888ec1e50b546822a32b6fb2f35e5eaa -> trunk/99024dec888ec1e50b546822a32b6fb2f35e5eaa 2025-12-04T09:17:11.9454307Z * [new tag] trunk/9a296e640fc88aa44d275b48cd9cc30c573b169d -> trunk/9a296e640fc88aa44d275b48cd9cc30c573b169d 2025-12-04T09:17:11.9455831Z * [new tag] trunk/9b3e34d8589b29f7b4e7fab6f78711b7ca6e4639 -> trunk/9b3e34d8589b29f7b4e7fab6f78711b7ca6e4639 2025-12-04T09:17:11.9457412Z * [new tag] trunk/9cd055e547e9b67a5f9827f8999c38d7eda1bcb8 -> trunk/9cd055e547e9b67a5f9827f8999c38d7eda1bcb8 2025-12-04T09:17:11.9458944Z * [new tag] trunk/9f0df5686cb4ada94f94620acba2e3c3f363b11d -> trunk/9f0df5686cb4ada94f94620acba2e3c3f363b11d 2025-12-04T09:17:11.9460400Z * [new tag] trunk/9f7fceb887d0cfa0326a59b887821c63ff11340a -> trunk/9f7fceb887d0cfa0326a59b887821c63ff11340a 2025-12-04T09:17:11.9461906Z * [new tag] trunk/9f8ef8855d3078d70f7b782540ff2aaf158d6742 -> trunk/9f8ef8855d3078d70f7b782540ff2aaf158d6742 2025-12-04T09:17:11.9463463Z * [new tag] trunk/9fb52efc797b47a1f425a03aa5e47b866d8b1098 -> trunk/9fb52efc797b47a1f425a03aa5e47b866d8b1098 2025-12-04T09:17:11.9464936Z * [new tag] trunk/9ff4a2ebc5762d46c73e46b1b523d7ff349fedfa -> trunk/9ff4a2ebc5762d46c73e46b1b523d7ff349fedfa 2025-12-04T09:17:11.9466678Z * [new tag] trunk/a0f3937b94422354538ebbd47202d5b0e8a3fd0d -> trunk/a0f3937b94422354538ebbd47202d5b0e8a3fd0d 2025-12-04T09:17:11.9468157Z * [new tag] trunk/a15066c28b3145e6edbfc88359d0411d14cfc70c -> trunk/a15066c28b3145e6edbfc88359d0411d14cfc70c 2025-12-04T09:17:11.9469675Z * [new tag] trunk/a20f775e82564d2a9979221ed7f3b8d7cf54ce90 -> trunk/a20f775e82564d2a9979221ed7f3b8d7cf54ce90 2025-12-04T09:17:11.9471196Z * [new tag] trunk/a2973fb00ec002dd4b6bbf07385f066efb259b8c -> trunk/a2973fb00ec002dd4b6bbf07385f066efb259b8c 2025-12-04T09:17:11.9472570Z * [new tag] trunk/a7dc6dab9ad911259d4801c502907e531594db45 -> trunk/a7dc6dab9ad911259d4801c502907e531594db45 2025-12-04T09:17:11.9474214Z * [new tag] trunk/a951a9cee65c01660bbc6e6fded90ecb10fa6109 -> trunk/a951a9cee65c01660bbc6e6fded90ecb10fa6109 2025-12-04T09:17:11.9475697Z * [new tag] trunk/abfa1a6d65c7c159e35c72c25979b9da4971689e -> trunk/abfa1a6d65c7c159e35c72c25979b9da4971689e 2025-12-04T09:17:11.9477224Z * [new tag] trunk/ae3a2395bf66151078e2d201716f7d63ce1c6f3e -> trunk/ae3a2395bf66151078e2d201716f7d63ce1c6f3e 2025-12-04T09:17:11.9478720Z * [new tag] trunk/afdff7f0325080dedac44d080cb5a3b0e65e6c5e -> trunk/afdff7f0325080dedac44d080cb5a3b0e65e6c5e 2025-12-04T09:17:11.9480476Z * [new tag] trunk/b1aed4e7a72c03a38f44543aaea0dae2e9b76d48 -> trunk/b1aed4e7a72c03a38f44543aaea0dae2e9b76d48 2025-12-04T09:17:11.9481508Z * [new tag] trunk/b1decff555cd50e2123c8c6e25cc0d447c411f62 -> trunk/b1decff555cd50e2123c8c6e25cc0d447c411f62 2025-12-04T09:17:11.9483422Z * [new tag] trunk/b2b6b034c9fd08672c40e63ef243556ad4c49bd2 -> trunk/b2b6b034c9fd08672c40e63ef243556ad4c49bd2 2025-12-04T09:17:11.9484925Z * [new tag] trunk/b39813b4a04931682b0491adba2138d01d716d99 -> trunk/b39813b4a04931682b0491adba2138d01d716d99 2025-12-04T09:17:11.9486506Z * [new tag] trunk/b3a7edb2311367974cc7cd764cfb11a5d6758b24 -> trunk/b3a7edb2311367974cc7cd764cfb11a5d6758b24 2025-12-04T09:17:11.9488065Z * [new tag] trunk/b4cc1329c86acaef6d42c1fac7169b8d870ab0d7 -> trunk/b4cc1329c86acaef6d42c1fac7169b8d870ab0d7 2025-12-04T09:17:11.9489553Z * [new tag] trunk/b555c39217f765759954a4f9f9bd1e9b87bed11a -> trunk/b555c39217f765759954a4f9f9bd1e9b87bed11a 2025-12-04T09:17:11.9491058Z * [new tag] trunk/b6b6c80379388b7f9932c3e6a0f9907bf430e417 -> trunk/b6b6c80379388b7f9932c3e6a0f9907bf430e417 2025-12-04T09:17:11.9492645Z * [new tag] trunk/b6b6d912df0b6f4082f8e50b18bd1de1dd7325f4 -> trunk/b6b6d912df0b6f4082f8e50b18bd1de1dd7325f4 2025-12-04T09:17:11.9494178Z * [new tag] trunk/b7d60685f8cbc939b68a20871e90db67e729329b -> trunk/b7d60685f8cbc939b68a20871e90db67e729329b 2025-12-04T09:17:11.9495722Z * [new tag] trunk/b7f6b9a4fc6259f7af068f31868b3119bb1bac3e -> trunk/b7f6b9a4fc6259f7af068f31868b3119bb1bac3e 2025-12-04T09:17:11.9497363Z * [new tag] trunk/b8c4ba3593761e7b2a3ebd86f040fb07b47c02cf -> trunk/b8c4ba3593761e7b2a3ebd86f040fb07b47c02cf 2025-12-04T09:17:11.9498792Z * [new tag] trunk/b9c8f3a4884befb965ff42620ce44a71b04887f5 -> trunk/b9c8f3a4884befb965ff42620ce44a71b04887f5 2025-12-04T09:17:11.9500594Z * [new tag] trunk/ba1412546f3082c0958c077acc2025e4dbc33f1f -> trunk/ba1412546f3082c0958c077acc2025e4dbc33f1f 2025-12-04T09:17:11.9502373Z * [new tag] trunk/bac403c0b38c63bdbcc0c31f1c2b0bc0260f610f -> trunk/bac403c0b38c63bdbcc0c31f1c2b0bc0260f610f 2025-12-04T09:17:11.9503837Z * [new tag] trunk/bb3034198b459401fabeab254e1b99f0115046e2 -> trunk/bb3034198b459401fabeab254e1b99f0115046e2 2025-12-04T09:17:11.9505355Z * [new tag] trunk/bc39b2b3bc7a6e19a42e62bd576974035086fe55 -> trunk/bc39b2b3bc7a6e19a42e62bd576974035086fe55 2025-12-04T09:17:11.9507250Z * [new tag] trunk/bc43d5b297f207a11d83d77ddf0152bdaabe15a8 -> trunk/bc43d5b297f207a11d83d77ddf0152bdaabe15a8 2025-12-04T09:17:11.9508889Z * [new tag] trunk/bc6a4863c7246a6493d16d4ea6eee71ec07c6a09 -> trunk/bc6a4863c7246a6493d16d4ea6eee71ec07c6a09 2025-12-04T09:17:11.9510177Z * [new tag] trunk/bea4912944defdbcb8b061800caab6cbbbd01df5 -> trunk/bea4912944defdbcb8b061800caab6cbbbd01df5 2025-12-04T09:17:11.9512108Z * [new tag] trunk/c04e2c656f48d82d1521b867bbbf03967b9b7564 -> trunk/c04e2c656f48d82d1521b867bbbf03967b9b7564 2025-12-04T09:17:11.9513568Z * [new tag] trunk/c0660bcee27e7d7731634e274576a7081882bede -> trunk/c0660bcee27e7d7731634e274576a7081882bede 2025-12-04T09:17:11.9515212Z * [new tag] trunk/c178ed43d3d99cbefe84fbfb21d6f282b20d62ac -> trunk/c178ed43d3d99cbefe84fbfb21d6f282b20d62ac 2025-12-04T09:17:11.9516692Z * [new tag] trunk/c55b1e8f61d041ee436d697449eb028931d574fb -> trunk/c55b1e8f61d041ee436d697449eb028931d574fb 2025-12-04T09:17:11.9518097Z * [new tag] trunk/c6ae7579fe12fe75f1a8f7043a494c90567273f1 -> trunk/c6ae7579fe12fe75f1a8f7043a494c90567273f1 2025-12-04T09:17:11.9520063Z * [new tag] trunk/c8210e7d94bad5ae21ac389fa4ba8a463c76c4d0 -> trunk/c8210e7d94bad5ae21ac389fa4ba8a463c76c4d0 2025-12-04T09:17:11.9521575Z * [new tag] trunk/cc0853af42122f8185321f542616f4474e717f09 -> trunk/cc0853af42122f8185321f542616f4474e717f09 2025-12-04T09:17:11.9522868Z * [new tag] trunk/cddec6562eabfa390d014fa3741a5659cf9c94c9 -> trunk/cddec6562eabfa390d014fa3741a5659cf9c94c9 2025-12-04T09:17:11.9524560Z * [new tag] trunk/ce5e7e3bf1f4b69a4f4f93d288ba75b906df492a -> trunk/ce5e7e3bf1f4b69a4f4f93d288ba75b906df492a 2025-12-04T09:17:11.9526135Z * [new tag] trunk/d038b0130ec7c20ebcac219301292fd8e98a1ace -> trunk/d038b0130ec7c20ebcac219301292fd8e98a1ace 2025-12-04T09:17:11.9527602Z * [new tag] trunk/d16447dacaf2420ea175f0c275c75da951f57d39 -> trunk/d16447dacaf2420ea175f0c275c75da951f57d39 2025-12-04T09:17:11.9529154Z * [new tag] trunk/d19f1e8cab6810bb2e99141f9976665954c67a50 -> trunk/d19f1e8cab6810bb2e99141f9976665954c67a50 2025-12-04T09:17:11.9530662Z * [new tag] trunk/d1c9f03b2a5af4104721712f8cdffe9b4f340c01 -> trunk/d1c9f03b2a5af4104721712f8cdffe9b4f340c01 2025-12-04T09:17:11.9532249Z * [new tag] trunk/d40f4950f2b7f7aa380a22fe0f6166e71680fbcf -> trunk/d40f4950f2b7f7aa380a22fe0f6166e71680fbcf 2025-12-04T09:17:11.9533857Z * [new tag] trunk/d5038950bacfe36bbf24a47a455fe76901deb8e8 -> trunk/d5038950bacfe36bbf24a47a455fe76901deb8e8 2025-12-04T09:17:11.9535697Z * [new tag] trunk/d54ff42903c2ae0533931ff11d23b35f875bdb3d -> trunk/d54ff42903c2ae0533931ff11d23b35f875bdb3d 2025-12-04T09:17:11.9537285Z * [new tag] trunk/d76697633a2d2b9cced1ae21161849b33bfe7e47 -> trunk/d76697633a2d2b9cced1ae21161849b33bfe7e47 2025-12-04T09:17:11.9538993Z * [new tag] trunk/d78f52b199c547106d4cd9d2856dd0805c118bf1 -> trunk/d78f52b199c547106d4cd9d2856dd0805c118bf1 2025-12-04T09:17:11.9540098Z * [new tag] trunk/d8fd5c6eed28e5004150691d048a3f6785e19a8e -> trunk/d8fd5c6eed28e5004150691d048a3f6785e19a8e 2025-12-04T09:17:11.9541841Z * [new tag] trunk/d900f5e86745dec76713f4b0ef07005ef36b2f5a -> trunk/d900f5e86745dec76713f4b0ef07005ef36b2f5a 2025-12-04T09:17:11.9543346Z * [new tag] trunk/d973dc6b87d763859fe1c5bd1287e3b6b1c49d1b -> trunk/d973dc6b87d763859fe1c5bd1287e3b6b1c49d1b 2025-12-04T09:17:11.9544958Z * [new tag] trunk/d998c03304cb6ede76e1ed535b4ddeb6c2bf40ec -> trunk/d998c03304cb6ede76e1ed535b4ddeb6c2bf40ec 2025-12-04T09:17:11.9546504Z * [new tag] trunk/d9cb8a70833101dbbe16b99520cfbdd70d0a87bf -> trunk/d9cb8a70833101dbbe16b99520cfbdd70d0a87bf 2025-12-04T09:17:11.9548032Z * [new tag] trunk/d9d5e91b43f70eb8637af55db6856d49be391ffd -> trunk/d9d5e91b43f70eb8637af55db6856d49be391ffd 2025-12-04T09:17:11.9549517Z * [new tag] trunk/dd18a75336a4fbd7497955cc5665904724fce889 -> trunk/dd18a75336a4fbd7497955cc5665904724fce889 2025-12-04T09:17:11.9551108Z * [new tag] trunk/ded9bcd61a059bf723e6e84689552962b480ea77 -> trunk/ded9bcd61a059bf723e6e84689552962b480ea77 2025-12-04T09:17:11.9552962Z * [new tag] trunk/dfbd3714d15c37a7b83b322a6b60f997fc00f50c -> trunk/dfbd3714d15c37a7b83b322a6b60f997fc00f50c 2025-12-04T09:17:11.9554624Z * [new tag] trunk/e115f9f4e4b039f8e9a642aaa2bd8254a920541b -> trunk/e115f9f4e4b039f8e9a642aaa2bd8254a920541b 2025-12-04T09:17:11.9555897Z * [new tag] trunk/e3f24fd73ad74c6e7176687986436956c7c18235 -> trunk/e3f24fd73ad74c6e7176687986436956c7c18235 2025-12-04T09:17:11.9557704Z * [new tag] trunk/e7d24d3ff93d1503ba63860b7057438ad93f918e -> trunk/e7d24d3ff93d1503ba63860b7057438ad93f918e 2025-12-04T09:17:11.9559289Z * [new tag] trunk/ea7035f462a0d2830865ee86c832bd101e1427fc -> trunk/ea7035f462a0d2830865ee86c832bd101e1427fc 2025-12-04T09:17:11.9560986Z * [new tag] trunk/eabb7ad2128580ef674446027b95bcf4e21e8df3 -> trunk/eabb7ad2128580ef674446027b95bcf4e21e8df3 2025-12-04T09:17:11.9562565Z * [new tag] trunk/eb5c63652a33da42e7018c23df5f20a3eb4c6ccf -> trunk/eb5c63652a33da42e7018c23df5f20a3eb4c6ccf 2025-12-04T09:17:11.9564101Z * [new tag] trunk/ec2c71f5c85021b8938cdafadce24c15a36fd93e -> trunk/ec2c71f5c85021b8938cdafadce24c15a36fd93e 2025-12-04T09:17:11.9565612Z * [new tag] trunk/ecbcc3f6bf327856b435b259ac63cc2f328c4b4e -> trunk/ecbcc3f6bf327856b435b259ac63cc2f328c4b4e 2025-12-04T09:17:11.9567517Z * [new tag] trunk/ee87bbe876c42575e961b32a0827d76bc9782ca2 -> trunk/ee87bbe876c42575e961b32a0827d76bc9782ca2 2025-12-04T09:17:11.9569150Z * [new tag] trunk/ef019d1d431c4c5a95b594cb90d40a50cd00f5e4 -> trunk/ef019d1d431c4c5a95b594cb90d40a50cd00f5e4 2025-12-04T09:17:11.9570677Z * [new tag] trunk/ef8ecc13830a86c4b231f1aad9aba7851db61b53 -> trunk/ef8ecc13830a86c4b231f1aad9aba7851db61b53 2025-12-04T09:17:11.9572176Z * [new tag] trunk/f1076f5510920044912247b1abb8760cb820f598 -> trunk/f1076f5510920044912247b1abb8760cb820f598 2025-12-04T09:17:11.9573664Z * [new tag] trunk/f2d6a75a00a1d648ca9a0abc6a33e14c3dea6c40 -> trunk/f2d6a75a00a1d648ca9a0abc6a33e14c3dea6c40 2025-12-04T09:17:11.9575203Z * [new tag] trunk/f47dd0ddef1359e5b43e4b962412f67b30ecde56 -> trunk/f47dd0ddef1359e5b43e4b962412f67b30ecde56 2025-12-04T09:17:11.9576716Z * [new tag] trunk/f49d32dfa4730dcfb1b60eeeb369b5889da983c8 -> trunk/f49d32dfa4730dcfb1b60eeeb369b5889da983c8 2025-12-04T09:17:11.9578139Z * [new tag] trunk/f4dedf78fc30fd4b93975787ca6074ee89db9467 -> trunk/f4dedf78fc30fd4b93975787ca6074ee89db9467 2025-12-04T09:17:11.9579699Z * [new tag] trunk/f7c0d03819ebed05c4038f095d66d1b8c54aca17 -> trunk/f7c0d03819ebed05c4038f095d66d1b8c54aca17 2025-12-04T09:17:11.9581280Z * [new tag] trunk/f7e1bd80a063e17453c361837ba6ea2570920a73 -> trunk/f7e1bd80a063e17453c361837ba6ea2570920a73 2025-12-04T09:17:11.9582665Z * [new tag] trunk/f9bd6c53624c7c0ea3772de78498326e84c2f0e7 -> trunk/f9bd6c53624c7c0ea3772de78498326e84c2f0e7 2025-12-04T09:17:11.9584214Z * [new tag] trunk/fb5be221a46b51bfc9509013b0d85bc5a9d4f15b -> trunk/fb5be221a46b51bfc9509013b0d85bc5a9d4f15b 2025-12-04T09:17:11.9585720Z * [new tag] trunk/fdf863d5e1de3b2688c9511e96876e34581dbfd7 -> trunk/fdf863d5e1de3b2688c9511e96876e34581dbfd7 2025-12-04T09:17:11.9587702Z * [new tag] trunk/fe0e65adfc0e7ca6e5f57e6ea8b16bd5cc967307 -> trunk/fe0e65adfc0e7ca6e5f57e6ea8b16bd5cc967307 2025-12-04T09:17:11.9589232Z * [new tag] trunk/fec710bf89173f5355468a7ce1afe9157c3d9009 -> trunk/fec710bf89173f5355468a7ce1afe9157c3d9009 2025-12-04T09:17:11.9590947Z * [new tag] trunk/ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 -> trunk/ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:17:11.9591827Z * [new tag] v0.1.1 -> v0.1.1 2025-12-04T09:17:11.9593765Z * [new tag] v0.1.10 -> v0.1.10 2025-12-04T09:17:11.9594643Z * [new tag] v0.1.11 -> v0.1.11 2025-12-04T09:17:11.9596315Z * [new tag] v0.1.12 -> v0.1.12 2025-12-04T09:17:11.9597713Z * [new tag] v0.1.2 -> v0.1.2 2025-12-04T09:17:11.9599084Z * [new tag] v0.1.3 -> v0.1.3 2025-12-04T09:17:11.9600558Z * [new tag] v0.1.4 -> v0.1.4 2025-12-04T09:17:11.9603436Z * [new tag] v0.1.5 -> v0.1.5 2025-12-04T09:17:11.9611511Z * [new tag] v0.1.6 -> v0.1.6 2025-12-04T09:17:11.9612040Z * [new tag] v0.1.7 -> v0.1.7 2025-12-04T09:17:11.9612335Z * [new tag] v0.1.8 -> v0.1.8 2025-12-04T09:17:11.9612475Z * [new tag] v0.1.9 -> v0.1.9 2025-12-04T09:17:11.9612615Z * [new tag] v0.2.0 -> v0.2.0 2025-12-04T09:17:11.9612768Z * [new tag] v0.3.0 -> v0.3.0 2025-12-04T09:17:11.9613467Z * [new tag] v0.3.1 -> v0.3.1 2025-12-04T09:17:11.9615079Z * [new tag] v0.4.0 -> v0.4.0 2025-12-04T09:17:11.9616431Z * [new tag] v0.4.1 -> v0.4.1 2025-12-04T09:17:11.9617836Z * [new tag] v1.0.0 -> v1.0.0 2025-12-04T09:17:11.9619247Z * [new tag] v1.0.0a0 -> v1.0.0a0 2025-12-04T09:17:11.9620686Z * [new tag] v1.0.1 -> v1.0.1 2025-12-04T09:17:11.9622089Z * [new tag] v1.0rc0 -> v1.0rc0 2025-12-04T09:17:11.9623370Z * [new tag] v1.0rc1 -> v1.0rc1 2025-12-04T09:17:11.9624826Z * [new tag] v1.1.0 -> v1.1.0 2025-12-04T09:17:11.9626276Z * [new tag] v1.1.0a0 -> v1.1.0a0 2025-12-04T09:17:11.9627867Z * [new tag] v1.10.0 -> v1.10.0 2025-12-04T09:17:11.9629457Z * [new tag] v1.10.0-rc1 -> v1.10.0-rc1 2025-12-04T09:17:11.9630859Z * [new tag] v1.10.0-rc2 -> v1.10.0-rc2 2025-12-04T09:17:11.9632069Z * [new tag] v1.10.0-rc3 -> v1.10.0-rc3 2025-12-04T09:17:11.9633825Z * [new tag] v1.10.1 -> v1.10.1 2025-12-04T09:17:11.9634604Z * [new tag] v1.10.1-rc1 -> v1.10.1-rc1 2025-12-04T09:17:11.9636013Z * [new tag] v1.10.2 -> v1.10.2 2025-12-04T09:17:11.9637213Z * [new tag] v1.10.2-rc1 -> v1.10.2-rc1 2025-12-04T09:17:11.9638727Z * [new tag] v1.11.0 -> v1.11.0 2025-12-04T09:17:11.9640950Z * [new tag] v1.11.0-rc1 -> v1.11.0-rc1 2025-12-04T09:17:11.9642625Z * [new tag] v1.11.0-rc2 -> v1.11.0-rc2 2025-12-04T09:17:11.9644112Z * [new tag] v1.11.0-rc3 -> v1.11.0-rc3 2025-12-04T09:17:11.9645560Z * [new tag] v1.11.0-rc4 -> v1.11.0-rc4 2025-12-04T09:17:11.9647023Z * [new tag] v1.11.0-rc5 -> v1.11.0-rc5 2025-12-04T09:17:11.9648268Z * [new tag] v1.11.0-rc6 -> v1.11.0-rc6 2025-12-04T09:17:11.9649472Z * [new tag] v1.11.0-rc7 -> v1.11.0-rc7 2025-12-04T09:17:11.9651194Z * [new tag] v1.12.0 -> v1.12.0 2025-12-04T09:17:11.9652412Z * [new tag] v1.12.0-rc1 -> v1.12.0-rc1 2025-12-04T09:17:11.9654023Z * [new tag] v1.12.0-rc2 -> v1.12.0-rc2 2025-12-04T09:17:11.9655454Z * [new tag] v1.12.0-rc3 -> v1.12.0-rc3 2025-12-04T09:17:11.9656974Z * [new tag] v1.12.0-rc4 -> v1.12.0-rc4 2025-12-04T09:17:11.9658374Z * [new tag] v1.12.0-rc5 -> v1.12.0-rc5 2025-12-04T09:17:11.9659969Z * [new tag] v1.12.0-rc6 -> v1.12.0-rc6 2025-12-04T09:17:11.9661167Z * [new tag] v1.12.0-rc7 -> v1.12.0-rc7 2025-12-04T09:17:11.9662410Z * [new tag] v1.12.0-rc8 -> v1.12.0-rc8 2025-12-04T09:17:11.9663600Z * [new tag] v1.12.1 -> v1.12.1 2025-12-04T09:17:11.9665130Z * [new tag] v1.12.1-rc1 -> v1.12.1-rc1 2025-12-04T09:17:11.9666574Z * [new tag] v1.12.1-rc2 -> v1.12.1-rc2 2025-12-04T09:17:11.9668117Z * [new tag] v1.12.1-rc3 -> v1.12.1-rc3 2025-12-04T09:17:11.9669598Z * [new tag] v1.12.1-rc4 -> v1.12.1-rc4 2025-12-04T09:17:11.9670806Z * [new tag] v1.12.1-rc5 -> v1.12.1-rc5 2025-12-04T09:17:11.9672267Z * [new tag] v1.13.0 -> v1.13.0 2025-12-04T09:17:11.9673669Z * [new tag] v1.13.0-rc1 -> v1.13.0-rc1 2025-12-04T09:17:11.9675059Z * [new tag] v1.13.0-rc2 -> v1.13.0-rc2 2025-12-04T09:17:11.9676560Z * [new tag] v1.13.0-rc3 -> v1.13.0-rc3 2025-12-04T09:17:11.9678131Z * [new tag] v1.13.0-rc4 -> v1.13.0-rc4 2025-12-04T09:17:11.9679365Z * [new tag] v1.13.0-rc5 -> v1.13.0-rc5 2025-12-04T09:17:11.9680760Z * [new tag] v1.13.0-rc6 -> v1.13.0-rc6 2025-12-04T09:17:11.9682248Z * [new tag] v1.13.1 -> v1.13.1 2025-12-04T09:17:11.9683484Z * [new tag] v1.13.1-rc1 -> v1.13.1-rc1 2025-12-04T09:17:11.9684896Z * [new tag] v1.2.0 -> v1.2.0 2025-12-04T09:17:11.9686371Z * [new tag] v1.2.0a0 -> v1.2.0a0 2025-12-04T09:17:11.9687733Z * [new tag] v1.3.0 -> v1.3.0 2025-12-04T09:17:11.9689190Z * [new tag] v1.3.0a0 -> v1.3.0a0 2025-12-04T09:17:11.9690664Z * [new tag] v1.3.1 -> v1.3.1 2025-12-04T09:17:11.9691822Z * [new tag] v1.4.0 -> v1.4.0 2025-12-04T09:17:11.9693262Z * [new tag] v1.4.0a0 -> v1.4.0a0 2025-12-04T09:17:11.9694469Z * [new tag] v1.4.1 -> v1.4.1 2025-12-04T09:17:11.9696013Z * [new tag] v1.5.0 -> v1.5.0 2025-12-04T09:17:11.9697617Z * [new tag] v1.5.0-rc1 -> v1.5.0-rc1 2025-12-04T09:17:11.9699217Z * [new tag] v1.5.0-rc2 -> v1.5.0-rc2 2025-12-04T09:17:11.9700731Z * [new tag] v1.5.0-rc3 -> v1.5.0-rc3 2025-12-04T09:17:11.9702354Z * [new tag] v1.5.0-rc4 -> v1.5.0-rc4 2025-12-04T09:17:11.9703564Z * [new tag] v1.5.0-rc5 -> v1.5.0-rc5 2025-12-04T09:17:11.9705068Z * [new tag] v1.5.1 -> v1.5.1 2025-12-04T09:17:11.9706280Z * [new tag] v1.5.1-rc1 -> v1.5.1-rc1 2025-12-04T09:17:11.9707791Z * [new tag] v1.6.0 -> v1.6.0 2025-12-04T09:17:11.9708953Z * [new tag] v1.6.0-rc1 -> v1.6.0-rc1 2025-12-04T09:17:11.9710607Z * [new tag] v1.6.0-rc2 -> v1.6.0-rc2 2025-12-04T09:17:11.9711999Z * [new tag] v1.6.0-rc3 -> v1.6.0-rc3 2025-12-04T09:17:11.9713393Z * [new tag] v1.6.0-rc4 -> v1.6.0-rc4 2025-12-04T09:17:11.9714847Z * [new tag] v1.6.0-rc5 -> v1.6.0-rc5 2025-12-04T09:17:11.9716234Z * [new tag] v1.6.0-rc6 -> v1.6.0-rc6 2025-12-04T09:17:11.9717762Z * [new tag] v1.6.0-rc7 -> v1.6.0-rc7 2025-12-04T09:17:11.9719114Z * [new tag] v1.7.0 -> v1.7.0 2025-12-04T09:17:11.9720672Z * [new tag] v1.7.0-rc1 -> v1.7.0-rc1 2025-12-04T09:17:11.9722226Z * [new tag] v1.7.0-rc2 -> v1.7.0-rc2 2025-12-04T09:17:11.9723720Z * [new tag] v1.7.0-rc3 -> v1.7.0-rc3 2025-12-04T09:17:11.9724927Z * [new tag] v1.7.0-rc4 -> v1.7.0-rc4 2025-12-04T09:17:11.9726358Z * [new tag] v1.7.1 -> v1.7.1 2025-12-04T09:17:11.9727945Z * [new tag] v1.7.1-rc1 -> v1.7.1-rc1 2025-12-04T09:17:11.9729552Z * [new tag] v1.7.1-rc2 -> v1.7.1-rc2 2025-12-04T09:17:11.9730830Z * [new tag] v1.7.1-rc3 -> v1.7.1-rc3 2025-12-04T09:17:11.9732706Z * [new tag] v1.8.0 -> v1.8.0 2025-12-04T09:17:11.9734133Z * [new tag] v1.8.0-rc1 -> v1.8.0-rc1 2025-12-04T09:17:11.9735530Z * [new tag] v1.8.0-rc2 -> v1.8.0-rc2 2025-12-04T09:17:11.9736977Z * [new tag] v1.8.0-rc3 -> v1.8.0-rc3 2025-12-04T09:17:11.9738401Z * [new tag] v1.8.0-rc4 -> v1.8.0-rc4 2025-12-04T09:17:11.9739583Z * [new tag] v1.8.0-rc5 -> v1.8.0-rc5 2025-12-04T09:17:11.9740793Z * [new tag] v1.8.1 -> v1.8.1 2025-12-04T09:17:11.9742285Z * [new tag] v1.8.1-rc1 -> v1.8.1-rc1 2025-12-04T09:17:11.9743525Z * [new tag] v1.8.1-rc2 -> v1.8.1-rc2 2025-12-04T09:17:11.9744887Z * [new tag] v1.8.1-rc3 -> v1.8.1-rc3 2025-12-04T09:17:11.9746824Z * [new tag] v1.8.2 -> v1.8.2 2025-12-04T09:17:11.9748193Z * [new tag] v1.8.2-rc1 -> v1.8.2-rc1 2025-12-04T09:17:11.9749638Z * [new tag] v1.9.0 -> v1.9.0 2025-12-04T09:17:11.9751086Z * [new tag] v1.9.0-rc1 -> v1.9.0-rc1 2025-12-04T09:17:11.9752647Z * [new tag] v1.9.0-rc2 -> v1.9.0-rc2 2025-12-04T09:17:11.9754093Z * [new tag] v1.9.0-rc3 -> v1.9.0-rc3 2025-12-04T09:17:11.9755370Z * [new tag] v1.9.0-rc4 -> v1.9.0-rc4 2025-12-04T09:17:11.9756824Z * [new tag] v1.9.1 -> v1.9.1 2025-12-04T09:17:11.9758524Z * [new tag] v1.9.1-rc1 -> v1.9.1-rc1 2025-12-04T09:17:11.9759904Z * [new tag] v1.9.1-rc2 -> v1.9.1-rc2 2025-12-04T09:17:11.9761323Z * [new tag] v2.0.0 -> v2.0.0 2025-12-04T09:17:11.9762719Z * [new tag] v2.0.0-rc1 -> v2.0.0-rc1 2025-12-04T09:17:11.9764176Z * [new tag] v2.0.0-rc2 -> v2.0.0-rc2 2025-12-04T09:17:11.9765672Z * [new tag] v2.0.0-rc3 -> v2.0.0-rc3 2025-12-04T09:17:11.9767073Z * [new tag] v2.0.0-rc4 -> v2.0.0-rc4 2025-12-04T09:17:11.9768583Z * [new tag] v2.0.0-rc5 -> v2.0.0-rc5 2025-12-04T09:17:11.9769962Z * [new tag] v2.0.0-rc6 -> v2.0.0-rc6 2025-12-04T09:17:11.9771421Z * [new tag] v2.0.1 -> v2.0.1 2025-12-04T09:17:11.9772902Z * [new tag] v2.0.1-rc1 -> v2.0.1-rc1 2025-12-04T09:17:11.9773914Z * [new tag] v2.0.1-rc2 -> v2.0.1-rc2 2025-12-04T09:17:11.9775495Z * [new tag] v2.0.1-rc3 -> v2.0.1-rc3 2025-12-04T09:17:11.9776738Z * [new tag] v2.0.1-rc4 -> v2.0.1-rc4 2025-12-04T09:17:11.9778761Z * [new tag] v2.1.0 -> v2.1.0 2025-12-04T09:17:11.9780212Z * [new tag] v2.1.0-rc1 -> v2.1.0-rc1 2025-12-04T09:17:11.9781724Z * [new tag] v2.1.0-rc2 -> v2.1.0-rc2 2025-12-04T09:17:11.9783221Z * [new tag] v2.1.0-rc3 -> v2.1.0-rc3 2025-12-04T09:17:11.9784725Z * [new tag] v2.1.0-rc4 -> v2.1.0-rc4 2025-12-04T09:17:11.9786235Z * [new tag] v2.1.0-rc5 -> v2.1.0-rc5 2025-12-04T09:17:11.9787475Z * [new tag] v2.1.0-rc6 -> v2.1.0-rc6 2025-12-04T09:17:11.9789090Z * [new tag] v2.1.1 -> v2.1.1 2025-12-04T09:17:11.9790647Z * [new tag] v2.1.1-rc1 -> v2.1.1-rc1 2025-12-04T09:17:11.9792188Z * [new tag] v2.1.1-rc2 -> v2.1.1-rc2 2025-12-04T09:17:11.9793805Z * [new tag] v2.1.1-rc3 -> v2.1.1-rc3 2025-12-04T09:17:11.9795338Z * [new tag] v2.1.1-rc4 -> v2.1.1-rc4 2025-12-04T09:17:11.9796732Z * [new tag] v2.1.1-rc5 -> v2.1.1-rc5 2025-12-04T09:17:11.9797947Z * [new tag] v2.1.1-rc6 -> v2.1.1-rc6 2025-12-04T09:17:11.9799397Z * [new tag] v2.1.2 -> v2.1.2 2025-12-04T09:17:11.9801148Z * [new tag] v2.1.2-rc1 -> v2.1.2-rc1 2025-12-04T09:17:11.9802773Z * [new tag] v2.1.2-rc2 -> v2.1.2-rc2 2025-12-04T09:17:11.9804019Z * [new tag] v2.1.2-rc3 -> v2.1.2-rc3 2025-12-04T09:17:11.9805514Z * [new tag] v2.2.0 -> v2.2.0 2025-12-04T09:17:11.9806993Z * [new tag] v2.2.0-rc1 -> v2.2.0-rc1 2025-12-04T09:17:11.9808453Z * [new tag] v2.2.0-rc2 -> v2.2.0-rc2 2025-12-04T09:17:11.9809882Z * [new tag] v2.2.0-rc3 -> v2.2.0-rc3 2025-12-04T09:17:11.9811313Z * [new tag] v2.2.0-rc4 -> v2.2.0-rc4 2025-12-04T09:17:11.9812753Z * [new tag] v2.2.0-rc5 -> v2.2.0-rc5 2025-12-04T09:17:11.9814177Z * [new tag] v2.2.0-rc6 -> v2.2.0-rc6 2025-12-04T09:17:11.9815392Z * [new tag] v2.2.0-rc7 -> v2.2.0-rc7 2025-12-04T09:17:11.9816627Z * [new tag] v2.2.0-rc8 -> v2.2.0-rc8 2025-12-04T09:17:11.9818181Z * [new tag] v2.2.1 -> v2.2.1 2025-12-04T09:17:11.9819692Z * [new tag] v2.2.1-rc1 -> v2.2.1-rc1 2025-12-04T09:17:11.9821069Z * [new tag] v2.2.1-rc2 -> v2.2.1-rc2 2025-12-04T09:17:11.9822281Z * [new tag] v2.2.1-rc3 -> v2.2.1-rc3 2025-12-04T09:17:11.9823450Z * [new tag] v2.2.2 -> v2.2.2 2025-12-04T09:17:11.9825479Z * [new tag] v2.2.2-rc1 -> v2.2.2-rc1 2025-12-04T09:17:11.9826726Z * [new tag] v2.2.2-rc2 -> v2.2.2-rc2 2025-12-04T09:17:11.9827954Z * [new tag] v2.2.2-rc3 -> v2.2.2-rc3 2025-12-04T09:17:11.9829671Z * [new tag] v2.3.0 -> v2.3.0 2025-12-04T09:17:11.9831031Z * [new tag] v2.3.0-rc1 -> v2.3.0-rc1 2025-12-04T09:17:11.9832529Z * [new tag] v2.3.0-rc10 -> v2.3.0-rc10 2025-12-04T09:17:11.9834066Z * [new tag] v2.3.0-rc11 -> v2.3.0-rc11 2025-12-04T09:17:11.9835300Z * [new tag] v2.3.0-rc12 -> v2.3.0-rc12 2025-12-04T09:17:11.9836756Z * [new tag] v2.3.0-rc2 -> v2.3.0-rc2 2025-12-04T09:17:11.9838423Z * [new tag] v2.3.0-rc3 -> v2.3.0-rc3 2025-12-04T09:17:11.9840018Z * [new tag] v2.3.0-rc4 -> v2.3.0-rc4 2025-12-04T09:17:11.9841485Z * [new tag] v2.3.0-rc5 -> v2.3.0-rc5 2025-12-04T09:17:11.9842758Z * [new tag] v2.3.0-rc6 -> v2.3.0-rc6 2025-12-04T09:17:11.9844197Z * [new tag] v2.3.0-rc7 -> v2.3.0-rc7 2025-12-04T09:17:11.9845700Z * [new tag] v2.3.0-rc8 -> v2.3.0-rc8 2025-12-04T09:17:11.9846947Z * [new tag] v2.3.0-rc9 -> v2.3.0-rc9 2025-12-04T09:17:11.9848170Z * [new tag] v2.3.1 -> v2.3.1 2025-12-04T09:17:11.9849729Z * [new tag] v2.3.1-rc1 -> v2.3.1-rc1 2025-12-04T09:17:11.9851176Z * [new tag] v2.3.1-rc2 -> v2.3.1-rc2 2025-12-04T09:17:11.9852732Z * [new tag] v2.3.1-rc3 -> v2.3.1-rc3 2025-12-04T09:17:11.9854189Z * [new tag] v2.4.0 -> v2.4.0 2025-12-04T09:17:11.9855611Z * [new tag] v2.4.0-rc1 -> v2.4.0-rc1 2025-12-04T09:17:11.9857071Z * [new tag] v2.4.0-rc2 -> v2.4.0-rc2 2025-12-04T09:17:11.9858511Z * [new tag] v2.4.0-rc3 -> v2.4.0-rc3 2025-12-04T09:17:11.9859922Z * [new tag] v2.4.0-rc4 -> v2.4.0-rc4 2025-12-04T09:17:11.9861444Z * [new tag] v2.4.0-rc5 -> v2.4.0-rc5 2025-12-04T09:17:11.9862897Z * [new tag] v2.4.0-rc6 -> v2.4.0-rc6 2025-12-04T09:17:11.9864431Z * [new tag] v2.4.0-rc7 -> v2.4.0-rc7 2025-12-04T09:17:11.9865821Z * [new tag] v2.4.0-rc8 -> v2.4.0-rc8 2025-12-04T09:17:11.9867331Z * [new tag] v2.4.0-rc9 -> v2.4.0-rc9 2025-12-04T09:17:11.9868624Z * [new tag] v2.4.1 -> v2.4.1 2025-12-04T09:17:11.9870173Z * [new tag] v2.4.1-rc1 -> v2.4.1-rc1 2025-12-04T09:17:11.9871602Z * [new tag] v2.4.1-rc2 -> v2.4.1-rc2 2025-12-04T09:17:11.9873120Z * [new tag] v2.4.1-rc3 -> v2.4.1-rc3 2025-12-04T09:17:11.9874576Z * [new tag] v2.5.0 -> v2.5.0 2025-12-04T09:17:11.9876028Z * [new tag] v2.5.0-rc1 -> v2.5.0-rc1 2025-12-04T09:17:11.9877232Z * [new tag] v2.5.0-rc10 -> v2.5.0-rc10 2025-12-04T09:17:11.9878831Z * [new tag] v2.5.0-rc2 -> v2.5.0-rc2 2025-12-04T09:17:11.9880368Z * [new tag] v2.5.0-rc3 -> v2.5.0-rc3 2025-12-04T09:17:11.9881877Z * [new tag] v2.5.0-rc4 -> v2.5.0-rc4 2025-12-04T09:17:11.9883328Z * [new tag] v2.5.0-rc5 -> v2.5.0-rc5 2025-12-04T09:17:11.9884944Z * [new tag] v2.5.0-rc6 -> v2.5.0-rc6 2025-12-04T09:17:11.9886370Z * [new tag] v2.5.0-rc7 -> v2.5.0-rc7 2025-12-04T09:17:11.9887903Z * [new tag] v2.5.0-rc8 -> v2.5.0-rc8 2025-12-04T09:17:11.9889480Z * [new tag] v2.5.0-rc9 -> v2.5.0-rc9 2025-12-04T09:17:11.9890385Z * [new tag] v2.5.1 -> v2.5.1 2025-12-04T09:17:11.9891798Z * [new tag] v2.5.1-rc1 -> v2.5.1-rc1 2025-12-04T09:17:11.9893100Z * [new tag] v2.6.0 -> v2.6.0 2025-12-04T09:17:11.9894664Z * [new tag] v2.6.0-rc1 -> v2.6.0-rc1 2025-12-04T09:17:11.9896212Z * [new tag] v2.6.0-rc2 -> v2.6.0-rc2 2025-12-04T09:17:11.9897693Z * [new tag] v2.6.0-rc3 -> v2.6.0-rc3 2025-12-04T09:17:11.9899158Z * [new tag] v2.6.0-rc4 -> v2.6.0-rc4 2025-12-04T09:17:11.9900834Z * [new tag] v2.6.0-rc5 -> v2.6.0-rc5 2025-12-04T09:17:11.9902566Z * [new tag] v2.6.0-rc6 -> v2.6.0-rc6 2025-12-04T09:17:11.9903973Z * [new tag] v2.6.0-rc7 -> v2.6.0-rc7 2025-12-04T09:17:11.9905668Z * [new tag] v2.6.0-rc8 -> v2.6.0-rc8 2025-12-04T09:17:11.9907205Z * [new tag] v2.6.0-rc9 -> v2.6.0-rc9 2025-12-04T09:17:11.9908906Z * [new tag] v2.7.0 -> v2.7.0 2025-12-04T09:17:11.9910409Z * [new tag] v2.7.0-rc1 -> v2.7.0-rc1 2025-12-04T09:17:11.9911675Z * [new tag] v2.7.0-rc10 -> v2.7.0-rc10 2025-12-04T09:17:11.9913249Z * [new tag] v2.7.0-rc2 -> v2.7.0-rc2 2025-12-04T09:17:11.9914764Z * [new tag] v2.7.0-rc3 -> v2.7.0-rc3 2025-12-04T09:17:11.9916263Z * [new tag] v2.7.0-rc4 -> v2.7.0-rc4 2025-12-04T09:17:11.9917735Z * [new tag] v2.7.0-rc5 -> v2.7.0-rc5 2025-12-04T09:17:11.9919658Z * [new tag] v2.7.0-rc6 -> v2.7.0-rc6 2025-12-04T09:17:11.9921264Z * [new tag] v2.7.0-rc7 -> v2.7.0-rc7 2025-12-04T09:17:11.9922805Z * [new tag] v2.7.0-rc8 -> v2.7.0-rc8 2025-12-04T09:17:11.9924265Z * [new tag] v2.7.0-rc9 -> v2.7.0-rc9 2025-12-04T09:17:11.9925504Z * [new tag] v2.7.1 -> v2.7.1 2025-12-04T09:17:11.9927068Z * [new tag] v2.7.1-rc1 -> v2.7.1-rc1 2025-12-04T09:17:11.9928551Z * [new tag] v2.7.1-rc2 -> v2.7.1-rc2 2025-12-04T09:17:11.9930209Z * [new tag] v2.7.1-rc3 -> v2.7.1-rc3 2025-12-04T09:17:11.9931816Z * [new tag] v2.7.1-rc4 -> v2.7.1-rc4 2025-12-04T09:17:11.9933612Z * [new tag] v2.7.1-rc5 -> v2.7.1-rc5 2025-12-04T09:17:11.9934645Z * [new tag] v2.8.0 -> v2.8.0 2025-12-04T09:17:11.9936260Z * [new tag] v2.8.0-rc1 -> v2.8.0-rc1 2025-12-04T09:17:11.9937811Z * [new tag] v2.8.0-rc2 -> v2.8.0-rc2 2025-12-04T09:17:11.9939415Z * [new tag] v2.8.0-rc3 -> v2.8.0-rc3 2025-12-04T09:17:11.9940939Z * [new tag] v2.8.0-rc4 -> v2.8.0-rc4 2025-12-04T09:17:11.9942496Z * [new tag] v2.8.0-rc5 -> v2.8.0-rc5 2025-12-04T09:17:11.9943995Z * [new tag] v2.8.0-rc6 -> v2.8.0-rc6 2025-12-04T09:17:11.9945533Z * [new tag] v2.8.0-rc7 -> v2.8.0-rc7 2025-12-04T09:17:11.9946995Z * [new tag] v2.8.0-rc8 -> v2.8.0-rc8 2025-12-04T09:17:11.9948638Z * [new tag] v2.9.0 -> v2.9.0 2025-12-04T09:17:11.9950092Z * [new tag] v2.9.0-rc1 -> v2.9.0-rc1 2025-12-04T09:17:11.9951735Z * [new tag] v2.9.0-rc10 -> v2.9.0-rc10 2025-12-04T09:17:11.9953161Z * [new tag] v2.9.0-rc11 -> v2.9.0-rc11 2025-12-04T09:17:11.9954953Z * [new tag] v2.9.0-rc2 -> v2.9.0-rc2 2025-12-04T09:17:11.9956446Z * [new tag] v2.9.0-rc3 -> v2.9.0-rc3 2025-12-04T09:17:11.9957997Z * [new tag] v2.9.0-rc4 -> v2.9.0-rc4 2025-12-04T09:17:11.9959501Z * [new tag] v2.9.0-rc5 -> v2.9.0-rc5 2025-12-04T09:17:11.9961368Z * [new tag] v2.9.0-rc6 -> v2.9.0-rc6 2025-12-04T09:17:11.9962844Z * [new tag] v2.9.0-rc7 -> v2.9.0-rc7 2025-12-04T09:17:11.9964521Z * [new tag] v2.9.0-rc8 -> v2.9.0-rc8 2025-12-04T09:17:11.9965796Z * [new tag] v2.9.0-rc9 -> v2.9.0-rc9 2025-12-04T09:17:11.9967097Z * [new tag] v2.9.1 -> v2.9.1 2025-12-04T09:17:11.9968699Z * [new tag] v2.9.1-rc1 -> v2.9.1-rc1 2025-12-04T09:17:11.9970235Z * [new tag] v2.9.1-rc2 -> v2.9.1-rc2 2025-12-04T09:17:11.9972398Z * [new tag] viable/strict/1759343184 -> viable/strict/1759343184 2025-12-04T09:17:11.9973847Z * [new tag] viable/strict/1759346540 -> viable/strict/1759346540 2025-12-04T09:17:11.9975196Z * [new tag] viable/strict/1759348181 -> viable/strict/1759348181 2025-12-04T09:17:11.9976597Z * [new tag] viable/strict/1759350324 -> viable/strict/1759350324 2025-12-04T09:17:11.9977970Z * [new tag] viable/strict/1759351793 -> viable/strict/1759351793 2025-12-04T09:17:11.9979554Z * [new tag] viable/strict/1759353844 -> viable/strict/1759353844 2025-12-04T09:17:11.9980934Z * [new tag] viable/strict/1759355374 -> viable/strict/1759355374 2025-12-04T09:17:11.9982278Z * [new tag] viable/strict/1759357472 -> viable/strict/1759357472 2025-12-04T09:17:11.9983666Z * [new tag] viable/strict/1759361002 -> viable/strict/1759361002 2025-12-04T09:17:11.9985421Z * [new tag] viable/strict/1759362585 -> viable/strict/1759362585 2025-12-04T09:17:11.9987107Z * [new tag] viable/strict/1759365359 -> viable/strict/1759365359 2025-12-04T09:17:11.9988656Z * [new tag] viable/strict/1759370089 -> viable/strict/1759370089 2025-12-04T09:17:11.9990171Z * [new tag] viable/strict/1759377554 -> viable/strict/1759377554 2025-12-04T09:17:11.9991690Z * [new tag] viable/strict/1759379133 -> viable/strict/1759379133 2025-12-04T09:17:11.9993142Z * [new tag] viable/strict/1759389871 -> viable/strict/1759389871 2025-12-04T09:17:11.9994625Z * [new tag] viable/strict/1759393562 -> viable/strict/1759393562 2025-12-04T09:17:11.9996091Z * [new tag] viable/strict/1759395076 -> viable/strict/1759395076 2025-12-04T09:17:11.9997647Z * [new tag] viable/strict/1759398579 -> viable/strict/1759398579 2025-12-04T09:17:11.9999162Z * [new tag] viable/strict/1759404142 -> viable/strict/1759404142 2025-12-04T09:17:12.0001014Z * [new tag] viable/strict/1759405773 -> viable/strict/1759405773 2025-12-04T09:17:12.0005064Z * [new tag] viable/strict/1759408041 -> viable/strict/1759408041 2025-12-04T09:17:12.0006518Z * [new tag] viable/strict/1759411593 -> viable/strict/1759411593 2025-12-04T09:17:12.0007967Z * [new tag] viable/strict/1759427395 -> viable/strict/1759427395 2025-12-04T09:17:12.0009506Z * [new tag] viable/strict/1759434582 -> viable/strict/1759434582 2025-12-04T09:17:12.0011012Z * [new tag] viable/strict/1759436720 -> viable/strict/1759436720 2025-12-04T09:17:12.0012747Z * [new tag] viable/strict/1759440219 -> viable/strict/1759440219 2025-12-04T09:17:12.0013865Z * [new tag] viable/strict/1759441948 -> viable/strict/1759441948 2025-12-04T09:17:12.0015408Z * [new tag] viable/strict/1759443860 -> viable/strict/1759443860 2025-12-04T09:17:12.0016905Z * [new tag] viable/strict/1759445377 -> viable/strict/1759445377 2025-12-04T09:17:12.0018487Z * [new tag] viable/strict/1759447415 -> viable/strict/1759447415 2025-12-04T09:17:12.0020430Z * [new tag] viable/strict/1759451750 -> viable/strict/1759451750 2025-12-04T09:17:12.0021929Z * [new tag] viable/strict/1759453910 -> viable/strict/1759453910 2025-12-04T09:17:12.0023418Z * [new tag] viable/strict/1759456483 -> viable/strict/1759456483 2025-12-04T09:17:12.0024977Z * [new tag] viable/strict/1759459279 -> viable/strict/1759459279 2025-12-04T09:17:12.0026424Z * [new tag] viable/strict/1759460742 -> viable/strict/1759460742 2025-12-04T09:17:12.0027902Z * [new tag] viable/strict/1759462025 -> viable/strict/1759462025 2025-12-04T09:17:12.0029544Z * [new tag] viable/strict/1759469086 -> viable/strict/1759469086 2025-12-04T09:17:12.0031092Z * [new tag] viable/strict/1759470581 -> viable/strict/1759470581 2025-12-04T09:17:12.0032559Z * [new tag] viable/strict/1759472786 -> viable/strict/1759472786 2025-12-04T09:17:12.0034025Z * [new tag] viable/strict/1759476294 -> viable/strict/1759476294 2025-12-04T09:17:12.0035550Z * [new tag] viable/strict/1759479963 -> viable/strict/1759479963 2025-12-04T09:17:12.0037024Z * [new tag] viable/strict/1759492177 -> viable/strict/1759492177 2025-12-04T09:17:12.0038594Z * [new tag] viable/strict/1759519278 -> viable/strict/1759519278 2025-12-04T09:17:12.0040087Z * [new tag] viable/strict/1759524580 -> viable/strict/1759524580 2025-12-04T09:17:12.0041615Z * [new tag] viable/strict/1759528193 -> viable/strict/1759528193 2025-12-04T09:17:12.0043282Z * [new tag] viable/strict/1759533797 -> viable/strict/1759533797 2025-12-04T09:17:12.0044755Z * [new tag] viable/strict/1759542780 -> viable/strict/1759542780 2025-12-04T09:17:12.0046246Z * [new tag] viable/strict/1759549779 -> viable/strict/1759549779 2025-12-04T09:17:12.0047876Z * [new tag] viable/strict/1759555455 -> viable/strict/1759555455 2025-12-04T09:17:12.0049322Z * [new tag] viable/strict/1759559176 -> viable/strict/1759559176 2025-12-04T09:17:12.0050830Z * [new tag] viable/strict/1759560629 -> viable/strict/1759560629 2025-12-04T09:17:12.0052326Z * [new tag] viable/strict/1759569848 -> viable/strict/1759569848 2025-12-04T09:17:12.0054045Z * [new tag] viable/strict/1759571382 -> viable/strict/1759571382 2025-12-04T09:17:12.0055514Z * [new tag] viable/strict/1759573474 -> viable/strict/1759573474 2025-12-04T09:17:12.0057242Z * [new tag] viable/strict/1759618187 -> viable/strict/1759618187 2025-12-04T09:17:12.0058780Z * [new tag] viable/strict/1759626742 -> viable/strict/1759626742 2025-12-04T09:17:12.0060321Z * [new tag] viable/strict/1759632427 -> viable/strict/1759632427 2025-12-04T09:17:12.0061732Z * [new tag] viable/strict/1759634971 -> viable/strict/1759634971 2025-12-04T09:17:12.0063278Z * [new tag] viable/strict/1759661382 -> viable/strict/1759661382 2025-12-04T09:17:12.0064805Z * [new tag] viable/strict/1759663294 -> viable/strict/1759663294 2025-12-04T09:17:12.0066218Z * [new tag] viable/strict/1759708178 -> viable/strict/1759708178 2025-12-04T09:17:12.0067761Z * [new tag] viable/strict/1759715695 -> viable/strict/1759715695 2025-12-04T09:17:12.0069197Z * [new tag] viable/strict/1759728293 -> viable/strict/1759728293 2025-12-04T09:17:12.0070728Z * [new tag] viable/strict/1759735513 -> viable/strict/1759735513 2025-12-04T09:17:12.0072274Z * [new tag] viable/strict/1759739177 -> viable/strict/1759739177 2025-12-04T09:17:12.0073698Z * [new tag] viable/strict/1759758635 -> viable/strict/1759758635 2025-12-04T09:17:12.0075180Z * [new tag] viable/strict/1759765784 -> viable/strict/1759765784 2025-12-04T09:17:12.0076648Z * [new tag] viable/strict/1759767948 -> viable/strict/1759767948 2025-12-04T09:17:12.0078205Z * [new tag] viable/strict/1759771461 -> viable/strict/1759771461 2025-12-04T09:17:12.0079696Z * [new tag] viable/strict/1759776706 -> viable/strict/1759776706 2025-12-04T09:17:12.0081322Z * [new tag] viable/strict/1759782317 -> viable/strict/1759782317 2025-12-04T09:17:12.0082857Z * [new tag] viable/strict/1759783777 -> viable/strict/1759783777 2025-12-04T09:17:12.0084429Z * [new tag] viable/strict/1759785815 -> viable/strict/1759785815 2025-12-04T09:17:12.0086012Z * [new tag] viable/strict/1759789459 -> viable/strict/1759789459 2025-12-04T09:17:12.0087481Z * [new tag] viable/strict/1759790974 -> viable/strict/1759790974 2025-12-04T09:17:12.0088886Z * [new tag] viable/strict/1759794583 -> viable/strict/1759794583 2025-12-04T09:17:12.0090329Z * [new tag] viable/strict/1759797408 -> viable/strict/1759797408 2025-12-04T09:17:12.0091864Z * [new tag] viable/strict/1759799518 -> viable/strict/1759799518 2025-12-04T09:17:12.0093334Z * [new tag] viable/strict/1759804909 -> viable/strict/1759804909 2025-12-04T09:17:12.0094841Z * [new tag] viable/strict/1759807643 -> viable/strict/1759807643 2025-12-04T09:17:12.0096376Z * [new tag] viable/strict/1759809089 -> viable/strict/1759809089 2025-12-04T09:17:12.0097839Z * [new tag] viable/strict/1759811145 -> viable/strict/1759811145 2025-12-04T09:17:12.0099383Z * [new tag] viable/strict/1759812581 -> viable/strict/1759812581 2025-12-04T09:17:12.0100922Z * [new tag] viable/strict/1759814683 -> viable/strict/1759814683 2025-12-04T09:17:12.0102665Z * [new tag] viable/strict/1759821889 -> viable/strict/1759821889 2025-12-04T09:17:12.0104171Z * [new tag] viable/strict/1759823376 -> viable/strict/1759823376 2025-12-04T09:17:12.0105665Z * [new tag] viable/strict/1759827107 -> viable/strict/1759827107 2025-12-04T09:17:12.0107127Z * [new tag] viable/strict/1759830577 -> viable/strict/1759830577 2025-12-04T09:17:12.0108725Z * [new tag] viable/strict/1759832720 -> viable/strict/1759832720 2025-12-04T09:17:12.0110171Z * [new tag] viable/strict/1759842063 -> viable/strict/1759842063 2025-12-04T09:17:12.0111648Z * [new tag] viable/strict/1759847121 -> viable/strict/1759847121 2025-12-04T09:17:12.0113424Z * [new tag] viable/strict/1759850721 -> viable/strict/1759850721 2025-12-04T09:17:12.0114953Z * [new tag] viable/strict/1759857870 -> viable/strict/1759857870 2025-12-04T09:17:12.0116474Z * [new tag] viable/strict/1759863143 -> viable/strict/1759863143 2025-12-04T09:17:12.0118479Z * [new tag] viable/strict/1759875874 -> viable/strict/1759875874 2025-12-04T09:17:12.0119892Z * [new tag] viable/strict/1759877385 -> viable/strict/1759877385 2025-12-04T09:17:12.0121432Z * [new tag] viable/strict/1759883801 -> viable/strict/1759883801 2025-12-04T09:17:12.0123061Z * [new tag] viable/strict/1759885922 -> viable/strict/1759885922 2025-12-04T09:17:12.0124402Z * [new tag] viable/strict/1759888488 -> viable/strict/1759888488 2025-12-04T09:17:12.0125876Z * [new tag] viable/strict/1759895471 -> viable/strict/1759895471 2025-12-04T09:17:12.0127384Z * [new tag] viable/strict/1759904803 -> viable/strict/1759904803 2025-12-04T09:17:12.0129255Z * [new tag] viable/strict/1759908300 -> viable/strict/1759908300 2025-12-04T09:17:12.0130802Z * [new tag] viable/strict/1759915520 -> viable/strict/1759915520 2025-12-04T09:17:12.0132273Z * [new tag] viable/strict/1759916978 -> viable/strict/1759916978 2025-12-04T09:17:12.0133603Z * [new tag] viable/strict/1759930024 -> viable/strict/1759930024 2025-12-04T09:17:12.0135136Z * [new tag] viable/strict/1759948122 -> viable/strict/1759948122 2025-12-04T09:17:12.0136690Z * [new tag] viable/strict/1759952983 -> viable/strict/1759952983 2025-12-04T09:17:12.0138268Z * [new tag] viable/strict/1759955121 -> viable/strict/1759955121 2025-12-04T09:17:12.0139802Z * [new tag] viable/strict/1759962298 -> viable/strict/1759962298 2025-12-04T09:17:12.0141330Z * [new tag] viable/strict/1759965837 -> viable/strict/1759965837 2025-12-04T09:17:12.0142898Z * [new tag] viable/strict/1759970213 -> viable/strict/1759970213 2025-12-04T09:17:12.0144422Z * [new tag] viable/strict/1759974894 -> viable/strict/1759974894 2025-12-04T09:17:12.0145871Z * [new tag] viable/strict/1759977763 -> viable/strict/1759977763 2025-12-04T09:17:12.0147394Z * [new tag] viable/strict/1759979241 -> viable/strict/1759979241 2025-12-04T09:17:12.0148951Z * [new tag] viable/strict/1759985417 -> viable/strict/1759985417 2025-12-04T09:17:12.0150454Z * [new tag] viable/strict/1759987490 -> viable/strict/1759987490 2025-12-04T09:17:12.0152042Z * [new tag] viable/strict/1759996180 -> viable/strict/1759996180 2025-12-04T09:17:12.0153512Z * [new tag] viable/strict/1760065682 -> viable/strict/1760065682 2025-12-04T09:17:12.0155025Z * [new tag] viable/strict/1760066894 -> viable/strict/1760066894 2025-12-04T09:17:12.0156578Z * [new tag] viable/strict/1760070345 -> viable/strict/1760070345 2025-12-04T09:17:12.0158051Z * [new tag] viable/strict/1760089782 -> viable/strict/1760089782 2025-12-04T09:17:12.0159633Z * [new tag] viable/strict/1760091921 -> viable/strict/1760091921 2025-12-04T09:17:12.0161224Z * [new tag] viable/strict/1760127924 -> viable/strict/1760127924 2025-12-04T09:17:12.0162761Z * [new tag] viable/strict/1760129489 -> viable/strict/1760129489 2025-12-04T09:17:12.0164411Z * [new tag] viable/strict/1760132980 -> viable/strict/1760132980 2025-12-04T09:17:12.0165880Z * [new tag] viable/strict/1760135060 -> viable/strict/1760135060 2025-12-04T09:17:12.0167370Z * [new tag] viable/strict/1760215782 -> viable/strict/1760215782 2025-12-04T09:17:12.0168939Z * [new tag] viable/strict/1760273849 -> viable/strict/1760273849 2025-12-04T09:17:12.0170364Z * [new tag] viable/strict/1760275517 -> viable/strict/1760275517 2025-12-04T09:17:12.0171859Z * [new tag] viable/strict/1760276979 -> viable/strict/1760276979 2025-12-04T09:17:12.0173414Z * [new tag] viable/strict/1760279007 -> viable/strict/1760279007 2025-12-04T09:17:12.0174961Z * [new tag] viable/strict/1760286328 -> viable/strict/1760286328 2025-12-04T09:17:12.0176280Z * [new tag] viable/strict/1760493304 -> viable/strict/1760493304 2025-12-04T09:17:12.0177888Z * [new tag] viable/strict/1760496298 -> viable/strict/1760496298 2025-12-04T09:17:12.0179311Z * [new tag] viable/strict/1760518396 -> viable/strict/1760518396 2025-12-04T09:17:12.0180873Z * [new tag] viable/strict/1760534864 -> viable/strict/1760534864 2025-12-04T09:17:12.0182336Z * [new tag] viable/strict/1760549062 -> viable/strict/1760549062 2025-12-04T09:17:12.0183914Z * [new tag] viable/strict/1760552799 -> viable/strict/1760552799 2025-12-04T09:17:12.0185498Z * [new tag] viable/strict/1760554355 -> viable/strict/1760554355 2025-12-04T09:17:12.0186996Z * [new tag] viable/strict/1760556275 -> viable/strict/1760556275 2025-12-04T09:17:12.0188521Z * [new tag] viable/strict/1760564979 -> viable/strict/1760564979 2025-12-04T09:17:12.0190109Z * [new tag] viable/strict/1760567049 -> viable/strict/1760567049 2025-12-04T09:17:12.0192021Z * [new tag] viable/strict/1760568585 -> viable/strict/1760568585 2025-12-04T09:17:12.0193497Z * [new tag] viable/strict/1760570630 -> viable/strict/1760570630 2025-12-04T09:17:12.0194970Z * [new tag] viable/strict/1760572180 -> viable/strict/1760572180 2025-12-04T09:17:12.0196442Z * [new tag] viable/strict/1760575094 -> viable/strict/1760575094 2025-12-04T09:17:12.0198072Z * [new tag] viable/strict/1760579709 -> viable/strict/1760579709 2025-12-04T09:17:12.0200159Z * [new tag] viable/strict/1760582614 -> viable/strict/1760582614 2025-12-04T09:17:12.0202012Z * [new tag] viable/strict/1760586815 -> viable/strict/1760586815 2025-12-04T09:17:12.0203310Z * [new tag] viable/strict/1760588829 -> viable/strict/1760588829 2025-12-04T09:17:12.0204791Z * [new tag] viable/strict/1760590200 -> viable/strict/1760590200 2025-12-04T09:17:12.0206345Z * [new tag] viable/strict/1760592311 -> viable/strict/1760592311 2025-12-04T09:17:12.0207851Z * [new tag] viable/strict/1760619733 -> viable/strict/1760619733 2025-12-04T09:17:12.0209243Z * [new tag] viable/strict/1760628335 -> viable/strict/1760628335 2025-12-04T09:17:12.0210748Z * [new tag] viable/strict/1760635490 -> viable/strict/1760635490 2025-12-04T09:17:12.0212195Z * [new tag] viable/strict/1760640743 -> viable/strict/1760640743 2025-12-04T09:17:12.0213677Z * [new tag] viable/strict/1760642528 -> viable/strict/1760642528 2025-12-04T09:17:12.0215230Z * [new tag] viable/strict/1760646330 -> viable/strict/1760646330 2025-12-04T09:17:12.0217125Z * [new tag] viable/strict/1760666101 -> viable/strict/1760666101 2025-12-04T09:17:12.0218690Z * [new tag] viable/strict/1760668990 -> viable/strict/1760668990 2025-12-04T09:17:12.0220191Z * [new tag] viable/strict/1760670600 -> viable/strict/1760670600 2025-12-04T09:17:12.0221679Z * [new tag] viable/strict/1760671704 -> viable/strict/1760671704 2025-12-04T09:17:12.0223200Z * [new tag] viable/strict/1760673121 -> viable/strict/1760673121 2025-12-04T09:17:12.0224635Z * [new tag] viable/strict/1760675352 -> viable/strict/1760675352 2025-12-04T09:17:12.0226178Z * [new tag] viable/strict/1760696731 -> viable/strict/1760696731 2025-12-04T09:17:12.0229145Z * [new tag] viable/strict/1760723515 -> viable/strict/1760723515 2025-12-04T09:17:12.0230664Z * [new tag] viable/strict/1760727234 -> viable/strict/1760727234 2025-12-04T09:17:12.0232154Z * [new tag] viable/strict/1760730578 -> viable/strict/1760730578 2025-12-04T09:17:12.0233649Z * [new tag] viable/strict/1760732726 -> viable/strict/1760732726 2025-12-04T09:17:12.0235322Z * [new tag] viable/strict/1760734180 -> viable/strict/1760734180 2025-12-04T09:17:12.0236749Z * [new tag] viable/strict/1760736251 -> viable/strict/1760736251 2025-12-04T09:17:12.0238216Z * [new tag] viable/strict/1760737772 -> viable/strict/1760737772 2025-12-04T09:17:12.0239839Z * [new tag] viable/strict/1760758005 -> viable/strict/1760758005 2025-12-04T09:17:12.0241432Z * [new tag] viable/strict/1760761532 -> viable/strict/1760761532 2025-12-04T09:17:12.0243027Z * [new tag] viable/strict/1760802581 -> viable/strict/1760802581 2025-12-04T09:17:12.0244494Z * [new tag] viable/strict/1760827772 -> viable/strict/1760827772 2025-12-04T09:17:12.0245973Z * [new tag] viable/strict/1760834524 -> viable/strict/1760834524 2025-12-04T09:17:12.0247509Z * [new tag] viable/strict/1760845009 -> viable/strict/1760845009 2025-12-04T09:17:12.0249086Z * [new tag] viable/strict/1760876836 -> viable/strict/1760876836 2025-12-04T09:17:12.0250574Z * [new tag] viable/strict/1760880329 -> viable/strict/1760880329 2025-12-04T09:17:12.0252041Z * [new tag] viable/strict/1760888987 -> viable/strict/1760888987 2025-12-04T09:17:12.0253521Z * [new tag] viable/strict/1760912664 -> viable/strict/1760912664 2025-12-04T09:17:12.0255040Z * [new tag] viable/strict/1760925321 -> viable/strict/1760925321 2025-12-04T09:17:12.0256628Z * [new tag] viable/strict/1760931488 -> viable/strict/1760931488 2025-12-04T09:17:12.0258135Z * [new tag] viable/strict/1760932693 -> viable/strict/1760932693 2025-12-04T09:17:12.0259640Z * [new tag] viable/strict/1761004184 -> viable/strict/1761004184 2025-12-04T09:17:12.0261090Z * [new tag] viable/strict/1761014748 -> viable/strict/1761014748 2025-12-04T09:17:12.0262592Z * [new tag] viable/strict/1761017491 -> viable/strict/1761017491 2025-12-04T09:17:12.0264085Z * [new tag] viable/strict/1761018806 -> viable/strict/1761018806 2025-12-04T09:17:12.0265662Z * [new tag] viable/strict/1761020754 -> viable/strict/1761020754 2025-12-04T09:17:12.0267335Z * [new tag] viable/strict/1761024303 -> viable/strict/1761024303 2025-12-04T09:17:12.0268751Z * [new tag] viable/strict/1761029582 -> viable/strict/1761029582 2025-12-04T09:17:12.0270164Z * [new tag] viable/strict/1761031535 -> viable/strict/1761031535 2025-12-04T09:17:12.0271618Z * [new tag] viable/strict/1761035196 -> viable/strict/1761035196 2025-12-04T09:17:12.0273217Z * [new tag] viable/strict/1761045825 -> viable/strict/1761045825 2025-12-04T09:17:12.0274843Z * [new tag] viable/strict/1761054796 -> viable/strict/1761054796 2025-12-04T09:17:12.0276355Z * [new tag] viable/strict/1761060314 -> viable/strict/1761060314 2025-12-04T09:17:12.0278018Z * [new tag] viable/strict/1761071198 -> viable/strict/1761071198 2025-12-04T09:17:12.0279682Z * [new tag] viable/strict/1761074628 -> viable/strict/1761074628 2025-12-04T09:17:12.0281293Z * [new tag] viable/strict/1761078351 -> viable/strict/1761078351 2025-12-04T09:17:12.0282793Z * [new tag] viable/strict/1761079822 -> viable/strict/1761079822 2025-12-04T09:17:12.0284276Z * [new tag] viable/strict/1761081873 -> viable/strict/1761081873 2025-12-04T09:17:12.0285815Z * [new tag] viable/strict/1761083392 -> viable/strict/1761083392 2025-12-04T09:17:12.0287344Z * [new tag] viable/strict/1761085465 -> viable/strict/1761085465 2025-12-04T09:17:12.0288886Z * [new tag] viable/strict/1761089099 -> viable/strict/1761089099 2025-12-04T09:17:12.0290443Z * [new tag] viable/strict/1761095535 -> viable/strict/1761095535 2025-12-04T09:17:12.0291865Z * [new tag] viable/strict/1761098119 -> viable/strict/1761098119 2025-12-04T09:17:12.0293760Z * [new tag] viable/strict/1761101330 -> viable/strict/1761101330 2025-12-04T09:17:12.0295270Z * [new tag] viable/strict/1761114425 -> viable/strict/1761114425 2025-12-04T09:17:12.0296871Z * [new tag] viable/strict/1761116036 -> viable/strict/1761116036 2025-12-04T09:17:12.0298401Z * [new tag] viable/strict/1761119379 -> viable/strict/1761119379 2025-12-04T09:17:12.0299882Z * [new tag] viable/strict/1761121601 -> viable/strict/1761121601 2025-12-04T09:17:12.0301581Z * [new tag] viable/strict/1761123234 -> viable/strict/1761123234 2025-12-04T09:17:12.0303010Z * [new tag] viable/strict/1761126621 -> viable/strict/1761126621 2025-12-04T09:17:12.0304480Z * [new tag] viable/strict/1761132259 -> viable/strict/1761132259 2025-12-04T09:17:12.0306047Z * [new tag] viable/strict/1761146746 -> viable/strict/1761146746 2025-12-04T09:17:12.0307627Z * [new tag] viable/strict/1761164752 -> viable/strict/1761164752 2025-12-04T09:17:12.0309187Z * [new tag] viable/strict/1761166198 -> viable/strict/1761166198 2025-12-04T09:17:12.0310692Z * [new tag] viable/strict/1761175424 -> viable/strict/1761175424 2025-12-04T09:17:12.0312204Z * [new tag] viable/strict/1761176983 -> viable/strict/1761176983 2025-12-04T09:17:12.0313868Z * [new tag] viable/strict/1761179891 -> viable/strict/1761179891 2025-12-04T09:17:12.0315460Z * [new tag] viable/strict/1761181930 -> viable/strict/1761181930 2025-12-04T09:17:12.0317359Z * [new tag] viable/strict/1761184516 -> viable/strict/1761184516 2025-12-04T09:17:12.0318988Z * [new tag] viable/strict/1761190179 -> viable/strict/1761190179 2025-12-04T09:17:12.0320554Z * [new tag] viable/strict/1761193558 -> viable/strict/1761193558 2025-12-04T09:17:12.0322137Z * [new tag] viable/strict/1761207990 -> viable/strict/1761207990 2025-12-04T09:17:12.0323657Z * [new tag] viable/strict/1761229539 -> viable/strict/1761229539 2025-12-04T09:17:12.0325336Z * [new tag] viable/strict/1761244031 -> viable/strict/1761244031 2025-12-04T09:17:12.0326850Z * [new tag] viable/strict/1761248986 -> viable/strict/1761248986 2025-12-04T09:17:12.0328538Z * [new tag] viable/strict/1761259791 -> viable/strict/1761259791 2025-12-04T09:17:12.0329966Z * [new tag] viable/strict/1761266139 -> viable/strict/1761266139 2025-12-04T09:17:12.0331486Z * [new tag] viable/strict/1761268316 -> viable/strict/1761268316 2025-12-04T09:17:12.0332994Z * [new tag] viable/strict/1761273805 -> viable/strict/1761273805 2025-12-04T09:17:12.0334519Z * [new tag] viable/strict/1761275261 -> viable/strict/1761275261 2025-12-04T09:17:12.0336076Z * [new tag] viable/strict/1761277913 -> viable/strict/1761277913 2025-12-04T09:17:12.0337619Z * [new tag] viable/strict/1761290701 -> viable/strict/1761290701 2025-12-04T09:17:12.0339165Z * [new tag] viable/strict/1761294396 -> viable/strict/1761294396 2025-12-04T09:17:12.0340720Z * [new tag] viable/strict/1761303047 -> viable/strict/1761303047 2025-12-04T09:17:12.0342266Z * [new tag] viable/strict/1761335388 -> viable/strict/1761335388 2025-12-04T09:17:12.0343753Z * [new tag] viable/strict/1761337551 -> viable/strict/1761337551 2025-12-04T09:17:12.0345410Z * [new tag] viable/strict/1761339007 -> viable/strict/1761339007 2025-12-04T09:17:12.0346767Z * [new tag] viable/strict/1761341050 -> viable/strict/1761341050 2025-12-04T09:17:12.0348286Z * [new tag] viable/strict/1761346188 -> viable/strict/1761346188 2025-12-04T09:17:12.0349979Z * [new tag] viable/strict/1761349792 -> viable/strict/1761349792 2025-12-04T09:17:12.0351497Z * [new tag] viable/strict/1761352620 -> viable/strict/1761352620 2025-12-04T09:17:12.0352968Z * [new tag] viable/strict/1761354730 -> viable/strict/1761354730 2025-12-04T09:17:12.0354514Z * [new tag] viable/strict/1761357298 -> viable/strict/1761357298 2025-12-04T09:17:12.0355993Z * [new tag] viable/strict/1761360201 -> viable/strict/1761360201 2025-12-04T09:17:12.0357539Z * [new tag] viable/strict/1761361753 -> viable/strict/1761361753 2025-12-04T09:17:12.0359086Z * [new tag] viable/strict/1761364351 -> viable/strict/1761364351 2025-12-04T09:17:12.0360664Z * [new tag] viable/strict/1761366338 -> viable/strict/1761366338 2025-12-04T09:17:12.0362284Z * [new tag] viable/strict/1761367802 -> viable/strict/1761367802 2025-12-04T09:17:12.0363810Z * [new tag] viable/strict/1761369889 -> viable/strict/1761369889 2025-12-04T09:17:12.0365396Z * [new tag] viable/strict/1761371385 -> viable/strict/1761371385 2025-12-04T09:17:12.0366996Z * [new tag] viable/strict/1761373581 -> viable/strict/1761373581 2025-12-04T09:17:12.0368745Z * [new tag] viable/strict/1761375054 -> viable/strict/1761375054 2025-12-04T09:17:12.0370271Z * [new tag] viable/strict/1761421785 -> viable/strict/1761421785 2025-12-04T09:17:12.0371890Z * [new tag] viable/strict/1761434614 -> viable/strict/1761434614 2025-12-04T09:17:12.0373785Z * [new tag] viable/strict/1761439254 -> viable/strict/1761439254 2025-12-04T09:17:12.0375346Z * [new tag] viable/strict/1761454187 -> viable/strict/1761454187 2025-12-04T09:17:12.0376958Z * [new tag] viable/strict/1761459991 -> viable/strict/1761459991 2025-12-04T09:17:12.0378850Z * [new tag] viable/strict/1761470668 -> viable/strict/1761470668 2025-12-04T09:17:12.0380704Z * [new tag] viable/strict/1761472188 -> viable/strict/1761472188 2025-12-04T09:17:12.0382465Z * [new tag] viable/strict/1761503178 -> viable/strict/1761503178 2025-12-04T09:17:12.0383792Z * [new tag] viable/strict/1761517492 -> viable/strict/1761517492 2025-12-04T09:17:12.0385349Z * [new tag] viable/strict/1761518981 -> viable/strict/1761518981 2025-12-04T09:17:12.0386921Z * [new tag] viable/strict/1761533609 -> viable/strict/1761533609 2025-12-04T09:17:12.0388308Z * [new tag] viable/strict/1761546438 -> viable/strict/1761546438 2025-12-04T09:17:12.0389941Z * [new tag] viable/strict/1761548133 -> viable/strict/1761548133 2025-12-04T09:17:12.0391706Z * [new tag] viable/strict/1761555186 -> viable/strict/1761555186 2025-12-04T09:17:12.0393337Z * [new tag] viable/strict/1761557178 -> viable/strict/1761557178 2025-12-04T09:17:12.0394852Z * [new tag] viable/strict/1761560772 -> viable/strict/1761560772 2025-12-04T09:17:12.0396393Z * [new tag] viable/strict/1761562266 -> viable/strict/1761562266 2025-12-04T09:17:12.0397954Z * [new tag] viable/strict/1761564260 -> viable/strict/1761564260 2025-12-04T09:17:12.0399554Z * [new tag] viable/strict/1761568072 -> viable/strict/1761568072 2025-12-04T09:17:12.0401273Z * [new tag] viable/strict/1761571683 -> viable/strict/1761571683 2025-12-04T09:17:12.0404229Z * [new tag] viable/strict/1761580199 -> viable/strict/1761580199 2025-12-04T09:17:12.0405709Z * [new tag] viable/strict/1761587383 -> viable/strict/1761587383 2025-12-04T09:17:12.0407271Z * [new tag] viable/strict/1761591165 -> viable/strict/1761591165 2025-12-04T09:17:12.0408873Z * [new tag] viable/strict/1761594575 -> viable/strict/1761594575 2025-12-04T09:17:12.0410429Z * [new tag] viable/strict/1761596710 -> viable/strict/1761596710 2025-12-04T09:17:12.0411905Z * [new tag] viable/strict/1761598189 -> viable/strict/1761598189 2025-12-04T09:17:12.0413410Z * [new tag] viable/strict/1761600254 -> viable/strict/1761600254 2025-12-04T09:17:12.0414948Z * [new tag] viable/strict/1761603879 -> viable/strict/1761603879 2025-12-04T09:17:12.0416585Z * [new tag] viable/strict/1761605429 -> viable/strict/1761605429 2025-12-04T09:17:12.0418153Z * [new tag] viable/strict/1761607468 -> viable/strict/1761607468 2025-12-04T09:17:12.0420154Z * [new tag] viable/strict/1761608983 -> viable/strict/1761608983 2025-12-04T09:17:12.0421750Z * [new tag] viable/strict/1761611846 -> viable/strict/1761611846 2025-12-04T09:17:12.0423443Z * [new tag] viable/strict/1761613922 -> viable/strict/1761613922 2025-12-04T09:17:12.0424804Z * [new tag] viable/strict/1761616504 -> viable/strict/1761616504 2025-12-04T09:17:12.0426116Z * [new tag] viable/strict/1761619599 -> viable/strict/1761619599 2025-12-04T09:17:12.0427693Z * [new tag] viable/strict/1761686693 -> viable/strict/1761686693 2025-12-04T09:17:12.0429276Z * [new tag] viable/strict/1761688179 -> viable/strict/1761688179 2025-12-04T09:17:12.0430895Z * [new tag] viable/strict/1761691973 -> viable/strict/1761691973 2025-12-04T09:17:12.0432595Z * [new tag] viable/strict/1761693884 -> viable/strict/1761693884 2025-12-04T09:17:12.0434164Z * [new tag] viable/strict/1761695389 -> viable/strict/1761695389 2025-12-04T09:17:12.0435726Z * [new tag] viable/strict/1761698408 -> viable/strict/1761698408 2025-12-04T09:17:12.0437211Z * [new tag] viable/strict/1761702931 -> viable/strict/1761702931 2025-12-04T09:17:12.0438762Z * [new tag] viable/strict/1761706307 -> viable/strict/1761706307 2025-12-04T09:17:12.0440503Z * [new tag] viable/strict/1761709065 -> viable/strict/1761709065 2025-12-04T09:17:12.0442209Z * [new tag] viable/strict/1761710285 -> viable/strict/1761710285 2025-12-04T09:17:12.0443821Z * [new tag] viable/strict/1761711983 -> viable/strict/1761711983 2025-12-04T09:17:12.0445422Z * [new tag] viable/strict/1761713514 -> viable/strict/1761713514 2025-12-04T09:17:12.0447160Z * [new tag] viable/strict/1761715523 -> viable/strict/1761715523 2025-12-04T09:17:12.0448828Z * [new tag] viable/strict/1761727973 -> viable/strict/1761727973 2025-12-04T09:17:12.0450402Z * [new tag] viable/strict/1761751558 -> viable/strict/1761751558 2025-12-04T09:17:12.0451999Z * [new tag] viable/strict/1761755187 -> viable/strict/1761755187 2025-12-04T09:17:12.0453614Z * [new tag] viable/strict/1761756826 -> viable/strict/1761756826 2025-12-04T09:17:12.0455281Z * [new tag] viable/strict/1761769551 -> viable/strict/1761769551 2025-12-04T09:17:12.0456969Z * [new tag] viable/strict/1761771032 -> viable/strict/1761771032 2025-12-04T09:17:12.0458502Z * [new tag] viable/strict/1761773101 -> viable/strict/1761773101 2025-12-04T09:17:12.0460069Z * [new tag] viable/strict/1761781792 -> viable/strict/1761781792 2025-12-04T09:17:12.0461837Z * [new tag] viable/strict/1761784788 -> viable/strict/1761784788 2025-12-04T09:17:12.0463324Z * [new tag] viable/strict/1761786740 -> viable/strict/1761786740 2025-12-04T09:17:12.0464912Z * [new tag] viable/strict/1761789332 -> viable/strict/1761789332 2025-12-04T09:17:12.0466971Z * [new tag] viable/strict/1761792569 -> viable/strict/1761792569 2025-12-04T09:17:12.0468578Z * [new tag] viable/strict/1761795289 -> viable/strict/1761795289 2025-12-04T09:17:12.0470162Z * [new tag] viable/strict/1761798345 -> viable/strict/1761798345 2025-12-04T09:17:12.0471745Z * [new tag] viable/strict/1761799827 -> viable/strict/1761799827 2025-12-04T09:17:12.0473394Z * [new tag] viable/strict/1761805604 -> viable/strict/1761805604 2025-12-04T09:17:12.0474985Z * [new tag] viable/strict/1761807202 -> viable/strict/1761807202 2025-12-04T09:17:12.0476586Z * [new tag] viable/strict/1761809094 -> viable/strict/1761809094 2025-12-04T09:17:12.0478331Z * [new tag] viable/strict/1761810576 -> viable/strict/1761810576 2025-12-04T09:17:12.0480359Z * [new tag] viable/strict/1761812771 -> viable/strict/1761812771 2025-12-04T09:17:12.0482100Z * [new tag] viable/strict/1761814363 -> viable/strict/1761814363 2025-12-04T09:17:12.0483780Z * [new tag] viable/strict/1761857410 -> viable/strict/1761857410 2025-12-04T09:17:12.0485373Z * [new tag] viable/strict/1761860985 -> viable/strict/1761860985 2025-12-04T09:17:12.0486998Z * [new tag] viable/strict/1761863094 -> viable/strict/1761863094 2025-12-04T09:17:12.0488754Z * [new tag] viable/strict/1761864590 -> viable/strict/1761864590 2025-12-04T09:17:12.0490228Z * [new tag] viable/strict/1761866675 -> viable/strict/1761866675 2025-12-04T09:17:12.0492003Z * [new tag] viable/strict/1761868178 -> viable/strict/1761868178 2025-12-04T09:17:12.0493645Z * [new tag] viable/strict/1761871111 -> viable/strict/1761871111 2025-12-04T09:17:12.0495240Z * [new tag] viable/strict/1761873126 -> viable/strict/1761873126 2025-12-04T09:17:12.0496885Z * [new tag] viable/strict/1761875714 -> viable/strict/1761875714 2025-12-04T09:17:12.0498556Z * [new tag] viable/strict/1761878924 -> viable/strict/1761878924 2025-12-04T09:17:12.0500195Z * [new tag] viable/strict/1761881727 -> viable/strict/1761881727 2025-12-04T09:17:12.0502049Z * [new tag] viable/strict/1761882959 -> viable/strict/1761882959 2025-12-04T09:17:12.0503589Z * [new tag] viable/strict/1761886268 -> viable/strict/1761886268 2025-12-04T09:17:12.0505128Z * [new tag] viable/strict/1761893641 -> viable/strict/1761893641 2025-12-04T09:17:12.0506770Z * [new tag] viable/strict/1761931517 -> viable/strict/1761931517 2025-12-04T09:17:12.0508453Z * [new tag] viable/strict/1761933080 -> viable/strict/1761933080 2025-12-04T09:17:12.0510778Z * [new tag] viable/strict/1761935217 -> viable/strict/1761935217 2025-12-04T09:17:12.0512229Z * [new tag] viable/strict/1761938533 -> viable/strict/1761938533 2025-12-04T09:17:12.0513840Z * [new tag] viable/strict/1761940184 -> viable/strict/1761940184 2025-12-04T09:17:12.0515426Z * [new tag] viable/strict/1761942338 -> viable/strict/1761942338 2025-12-04T09:17:12.0516993Z * [new tag] viable/strict/1761946100 -> viable/strict/1761946100 2025-12-04T09:17:12.0518851Z * [new tag] viable/strict/1761947374 -> viable/strict/1761947374 2025-12-04T09:17:12.0520835Z * [new tag] viable/strict/1761950978 -> viable/strict/1761950978 2025-12-04T09:17:12.0522546Z * [new tag] viable/strict/1761957727 -> viable/strict/1761957727 2025-12-04T09:17:12.0524004Z * [new tag] viable/strict/1761959532 -> viable/strict/1761959532 2025-12-04T09:17:12.0526107Z * [new tag] viable/strict/1761965366 -> viable/strict/1761965366 2025-12-04T09:17:12.0527946Z * [new tag] viable/strict/1761968066 -> viable/strict/1761968066 2025-12-04T09:17:12.0529492Z * [new tag] viable/strict/1761969322 -> viable/strict/1761969322 2025-12-04T09:17:12.0531092Z * [new tag] viable/strict/1761974723 -> viable/strict/1761974723 2025-12-04T09:17:12.0532667Z * [new tag] viable/strict/1761981837 -> viable/strict/1761981837 2025-12-04T09:17:12.0534393Z * [new tag] viable/strict/1761985546 -> viable/strict/1761985546 2025-12-04T09:17:12.0536299Z * [new tag] viable/strict/1761987030 -> viable/strict/1761987030 2025-12-04T09:17:12.0538056Z * [new tag] viable/strict/1762003554 -> viable/strict/1762003554 2025-12-04T09:17:12.0539697Z * [new tag] viable/strict/1762021560 -> viable/strict/1762021560 2025-12-04T09:17:12.0541300Z * [new tag] viable/strict/1762032190 -> viable/strict/1762032190 2025-12-04T09:17:12.0542957Z * [new tag] viable/strict/1762040981 -> viable/strict/1762040981 2025-12-04T09:17:12.0544614Z * [new tag] viable/strict/1762048525 -> viable/strict/1762048525 2025-12-04T09:17:12.0546236Z * [new tag] viable/strict/1762104223 -> viable/strict/1762104223 2025-12-04T09:17:12.0547858Z * [new tag] viable/strict/1762105778 -> viable/strict/1762105778 2025-12-04T09:17:12.0549456Z * [new tag] viable/strict/1762115109 -> viable/strict/1762115109 2025-12-04T09:17:12.0551057Z * [new tag] viable/strict/1762125840 -> viable/strict/1762125840 2025-12-04T09:17:12.0552534Z * [new tag] viable/strict/1762127377 -> viable/strict/1762127377 2025-12-04T09:17:12.0554502Z * [new tag] viable/strict/1762134925 -> viable/strict/1762134925 2025-12-04T09:17:12.0556003Z * [new tag] viable/strict/1762138338 -> viable/strict/1762138338 2025-12-04T09:17:12.0557630Z * [new tag] viable/strict/1762148993 -> viable/strict/1762148993 2025-12-04T09:17:12.0559276Z * [new tag] viable/strict/1762152871 -> viable/strict/1762152871 2025-12-04T09:17:12.0561021Z * [new tag] viable/strict/1762156183 -> viable/strict/1762156183 2025-12-04T09:17:12.0562630Z * [new tag] viable/strict/1762163457 -> viable/strict/1762163457 2025-12-04T09:17:12.0564210Z * [new tag] viable/strict/1762165569 -> viable/strict/1762165569 2025-12-04T09:17:12.0565827Z * [new tag] viable/strict/1762169035 -> viable/strict/1762169035 2025-12-04T09:17:12.0567448Z * [new tag] viable/strict/1762174936 -> viable/strict/1762174936 2025-12-04T09:17:12.0569022Z * [new tag] viable/strict/1762194412 -> viable/strict/1762194412 2025-12-04T09:17:12.0570585Z * [new tag] viable/strict/1762195876 -> viable/strict/1762195876 2025-12-04T09:17:12.0572158Z * [new tag] viable/strict/1762197788 -> viable/strict/1762197788 2025-12-04T09:17:12.0573802Z * [new tag] viable/strict/1762199389 -> viable/strict/1762199389 2025-12-04T09:17:12.0575636Z * [new tag] viable/strict/1762206585 -> viable/strict/1762206585 2025-12-04T09:17:12.0577349Z * [new tag] viable/strict/1762210184 -> viable/strict/1762210184 2025-12-04T09:17:12.0579017Z * [new tag] viable/strict/1762218736 -> viable/strict/1762218736 2025-12-04T09:17:12.0580616Z * [new tag] viable/strict/1762224529 -> viable/strict/1762224529 2025-12-04T09:17:12.0582354Z * [new tag] viable/strict/1762227253 -> viable/strict/1762227253 2025-12-04T09:17:12.0583706Z * [new tag] viable/strict/1762228515 -> viable/strict/1762228515 2025-12-04T09:17:12.0585387Z * [new tag] viable/strict/1762230349 -> viable/strict/1762230349 2025-12-04T09:17:12.0587072Z * [new tag] viable/strict/1762231859 -> viable/strict/1762231859 2025-12-04T09:17:12.0588706Z * [new tag] viable/strict/1762233925 -> viable/strict/1762233925 2025-12-04T09:17:12.0590521Z * [new tag] viable/strict/1762237630 -> viable/strict/1762237630 2025-12-04T09:17:12.0592065Z * [new tag] viable/strict/1762253522 -> viable/strict/1762253522 2025-12-04T09:17:12.0593728Z * [new tag] viable/strict/1762278588 -> viable/strict/1762278588 2025-12-04T09:17:12.0595334Z * [new tag] viable/strict/1762284203 -> viable/strict/1762284203 2025-12-04T09:17:12.0597004Z * [new tag] viable/strict/1762289446 -> viable/strict/1762289446 2025-12-04T09:17:12.0598614Z * [new tag] viable/strict/1762291515 -> viable/strict/1762291515 2025-12-04T09:17:12.0600583Z * [new tag] viable/strict/1762295100 -> viable/strict/1762295100 2025-12-04T09:17:12.0604326Z * [new tag] viable/strict/1762296590 -> viable/strict/1762296590 2025-12-04T09:17:12.0605757Z * [new tag] viable/strict/1762300179 -> viable/strict/1762300179 2025-12-04T09:17:12.0607232Z * [new tag] viable/strict/1762303207 -> viable/strict/1762303207 2025-12-04T09:17:12.0608835Z * [new tag] viable/strict/1762386584 -> viable/strict/1762386584 2025-12-04T09:17:12.0610462Z * [new tag] viable/strict/1762391537 -> viable/strict/1762391537 2025-12-04T09:17:12.0611890Z * [new tag] viable/strict/1762394119 -> viable/strict/1762394119 2025-12-04T09:17:12.0613855Z * [new tag] viable/strict/1762397437 -> viable/strict/1762397437 2025-12-04T09:17:12.0615457Z * [new tag] viable/strict/1762400256 -> viable/strict/1762400256 2025-12-04T09:17:12.0617016Z * [new tag] viable/strict/1762401469 -> viable/strict/1762401469 2025-12-04T09:17:12.0618742Z * [new tag] viable/strict/1762408195 -> viable/strict/1762408195 2025-12-04T09:17:12.0620447Z * [new tag] viable/strict/1762410411 -> viable/strict/1762410411 2025-12-04T09:17:12.0621989Z * [new tag] viable/strict/1762417613 -> viable/strict/1762417613 2025-12-04T09:17:12.0623637Z * [new tag] viable/strict/1762419198 -> viable/strict/1762419198 2025-12-04T09:17:12.0625260Z * [new tag] viable/strict/1762422656 -> viable/strict/1762422656 2025-12-04T09:17:12.0627209Z * [new tag] viable/strict/1762424746 -> viable/strict/1762424746 2025-12-04T09:17:12.0628866Z * [new tag] viable/strict/1762446386 -> viable/strict/1762446386 2025-12-04T09:17:12.0630519Z * [new tag] viable/strict/1762449912 -> viable/strict/1762449912 2025-12-04T09:17:12.0632114Z * [new tag] viable/strict/1762457031 -> viable/strict/1762457031 2025-12-04T09:17:12.0634206Z * [new tag] viable/strict/1762462441 -> viable/strict/1762462441 2025-12-04T09:17:12.0635795Z * [new tag] viable/strict/1762467909 -> viable/strict/1762467909 2025-12-04T09:17:12.0637459Z * [new tag] viable/strict/1762471493 -> viable/strict/1762471493 2025-12-04T09:17:12.0639200Z * [new tag] viable/strict/1762475990 -> viable/strict/1762475990 2025-12-04T09:17:12.0641026Z * [new tag] viable/strict/1762477933 -> viable/strict/1762477933 2025-12-04T09:17:12.0642645Z * [new tag] viable/strict/1762491053 -> viable/strict/1762491053 2025-12-04T09:17:12.0644460Z * [new tag] viable/strict/1762493118 -> viable/strict/1762493118 2025-12-04T09:17:12.0646115Z * [new tag] viable/strict/1762498442 -> viable/strict/1762498442 2025-12-04T09:17:12.0647632Z * [new tag] viable/strict/1762501778 -> viable/strict/1762501778 2025-12-04T09:17:12.0649453Z * [new tag] viable/strict/1762504001 -> viable/strict/1762504001 2025-12-04T09:17:12.0650986Z * [new tag] viable/strict/1762505583 -> viable/strict/1762505583 2025-12-04T09:17:12.0652650Z * [new tag] viable/strict/1762507523 -> viable/strict/1762507523 2025-12-04T09:17:12.0654361Z * [new tag] viable/strict/1762511140 -> viable/strict/1762511140 2025-12-04T09:17:12.0656054Z * [new tag] viable/strict/1762512632 -> viable/strict/1762512632 2025-12-04T09:17:12.0657769Z * [new tag] viable/strict/1762520467 -> viable/strict/1762520467 2025-12-04T09:17:12.0659383Z * [new tag] viable/strict/1762522016 -> viable/strict/1762522016 2025-12-04T09:17:12.0660982Z * [new tag] viable/strict/1762530591 -> viable/strict/1762530591 2025-12-04T09:17:12.0662580Z * [new tag] viable/strict/1762543405 -> viable/strict/1762543405 2025-12-04T09:17:12.0664059Z * [new tag] viable/strict/1762544998 -> viable/strict/1762544998 2025-12-04T09:17:12.0665607Z * [new tag] viable/strict/1762552182 -> viable/strict/1762552182 2025-12-04T09:17:12.0667188Z * [new tag] viable/strict/1762554297 -> viable/strict/1762554297 2025-12-04T09:17:12.0668822Z * [new tag] viable/strict/1762559381 -> viable/strict/1762559381 2025-12-04T09:17:12.0670337Z * [new tag] viable/strict/1762562222 -> viable/strict/1762562222 2025-12-04T09:17:12.0671981Z * [new tag] viable/strict/1762564319 -> viable/strict/1762564319 2025-12-04T09:17:12.0673501Z * [new tag] viable/strict/1762566904 -> viable/strict/1762566904 2025-12-04T09:17:12.0676112Z * [new tag] viable/strict/1762569781 -> viable/strict/1762569781 2025-12-04T09:17:12.0676924Z * [new tag] viable/strict/1762575940 -> viable/strict/1762575940 2025-12-04T09:17:12.0678161Z * [new tag] viable/strict/1762580974 -> viable/strict/1762580974 2025-12-04T09:17:12.0680119Z * [new tag] viable/strict/1762583185 -> viable/strict/1762583185 2025-12-04T09:17:12.0687420Z * [new tag] viable/strict/1762586647 -> viable/strict/1762586647 2025-12-04T09:17:12.0687716Z * [new tag] viable/strict/1762588183 -> viable/strict/1762588183 2025-12-04T09:17:12.0688001Z * [new tag] viable/strict/1762593886 -> viable/strict/1762593886 2025-12-04T09:17:12.0688260Z * [new tag] viable/strict/1762650743 -> viable/strict/1762650743 2025-12-04T09:17:12.0688452Z * [new tag] viable/strict/1762653328 -> viable/strict/1762653328 2025-12-04T09:17:12.0689920Z * [new tag] viable/strict/1762659342 -> viable/strict/1762659342 2025-12-04T09:17:12.0691337Z * [new tag] viable/strict/1762662360 -> viable/strict/1762662360 2025-12-04T09:17:12.0692993Z * [new tag] viable/strict/1762667377 -> viable/strict/1762667377 2025-12-04T09:17:12.0694555Z * [new tag] viable/strict/1762671090 -> viable/strict/1762671090 2025-12-04T09:17:12.0696167Z * [new tag] viable/strict/1762680284 -> viable/strict/1762680284 2025-12-04T09:17:12.0697876Z * [new tag] viable/strict/1762683900 -> viable/strict/1762683900 2025-12-04T09:17:12.0699529Z * [new tag] viable/strict/1762705541 -> viable/strict/1762705541 2025-12-04T09:17:12.0701219Z * [new tag] viable/strict/1762709004 -> viable/strict/1762709004 2025-12-04T09:17:12.0703179Z * [new tag] viable/strict/1762746004 -> viable/strict/1762746004 2025-12-04T09:17:12.0704766Z * [new tag] viable/strict/1762748799 -> viable/strict/1762748799 2025-12-04T09:17:12.0706382Z * [new tag] viable/strict/1762759504 -> viable/strict/1762759504 2025-12-04T09:17:12.0708093Z * [new tag] viable/strict/1762760973 -> viable/strict/1762760973 2025-12-04T09:17:12.0709696Z * [new tag] viable/strict/1762775374 -> viable/strict/1762775374 2025-12-04T09:17:12.0711312Z * [new tag] viable/strict/1762777661 -> viable/strict/1762777661 2025-12-04T09:17:12.0712926Z * [new tag] viable/strict/1762779774 -> viable/strict/1762779774 2025-12-04T09:17:12.0714648Z * [new tag] viable/strict/1762781259 -> viable/strict/1762781259 2025-12-04T09:17:12.0716346Z * [new tag] viable/strict/1762793628 -> viable/strict/1762793628 2025-12-04T09:17:12.0717974Z * [new tag] viable/strict/1762800711 -> viable/strict/1762800711 2025-12-04T09:17:12.0719710Z * [new tag] viable/strict/1762809894 -> viable/strict/1762809894 2025-12-04T09:17:12.0721359Z * [new tag] viable/strict/1762811384 -> viable/strict/1762811384 2025-12-04T09:17:12.0723081Z * [new tag] viable/strict/1762813841 -> viable/strict/1762813841 2025-12-04T09:17:12.0724686Z * [new tag] viable/strict/1762815047 -> viable/strict/1762815047 2025-12-04T09:17:12.0726437Z * [new tag] viable/strict/1762817094 -> viable/strict/1762817094 2025-12-04T09:17:12.0728059Z * [new tag] viable/strict/1762818582 -> viable/strict/1762818582 2025-12-04T09:17:12.0729741Z * [new tag] viable/strict/1762821623 -> viable/strict/1762821623 2025-12-04T09:17:12.0731227Z * [new tag] viable/strict/1762823531 -> viable/strict/1762823531 2025-12-04T09:17:12.0732898Z * [new tag] viable/strict/1762849583 -> viable/strict/1762849583 2025-12-04T09:17:12.0734511Z * [new tag] viable/strict/1762851200 -> viable/strict/1762851200 2025-12-04T09:17:12.0736108Z * [new tag] viable/strict/1762854603 -> viable/strict/1762854603 2025-12-04T09:17:12.0737742Z * [new tag] viable/strict/1762858276 -> viable/strict/1762858276 2025-12-04T09:17:12.0740001Z * [new tag] viable/strict/1762860891 -> viable/strict/1762860891 2025-12-04T09:17:12.0742201Z * [new tag] viable/strict/1762866174 -> viable/strict/1762866174 2025-12-04T09:17:12.0743800Z * [new tag] viable/strict/1762867653 -> viable/strict/1762867653 2025-12-04T09:17:12.0745413Z * [new tag] viable/strict/1762872669 -> viable/strict/1762872669 2025-12-04T09:17:12.0746842Z * [new tag] viable/strict/1762878380 -> viable/strict/1762878380 2025-12-04T09:17:12.0748481Z * [new tag] viable/strict/1762889003 -> viable/strict/1762889003 2025-12-04T09:17:12.0750129Z * [new tag] viable/strict/1762890589 -> viable/strict/1762890589 2025-12-04T09:17:12.0751860Z * [new tag] viable/strict/1762892743 -> viable/strict/1762892743 2025-12-04T09:17:12.0753464Z * [new tag] viable/strict/1762894271 -> viable/strict/1762894271 2025-12-04T09:17:12.0754952Z * [new tag] viable/strict/1762896287 -> viable/strict/1762896287 2025-12-04T09:17:12.0756532Z * [new tag] viable/strict/1762915871 -> viable/strict/1762915871 2025-12-04T09:17:12.0758331Z * [new tag] viable/strict/1762918569 -> viable/strict/1762918569 2025-12-04T09:17:12.0759829Z * [new tag] viable/strict/1762919776 -> viable/strict/1762919776 2025-12-04T09:17:12.0761560Z * [new tag] viable/strict/1762923072 -> viable/strict/1762923072 2025-12-04T09:17:12.0763292Z * [new tag] viable/strict/1762928826 -> viable/strict/1762928826 2025-12-04T09:17:12.0764911Z * [new tag] viable/strict/1762930451 -> viable/strict/1762930451 2025-12-04T09:17:12.0766487Z * [new tag] viable/strict/1762933780 -> viable/strict/1762933780 2025-12-04T09:17:12.0768181Z * [new tag] viable/strict/1762937638 -> viable/strict/1762937638 2025-12-04T09:17:12.0770344Z * [new tag] viable/strict/1762939545 -> viable/strict/1762939545 2025-12-04T09:17:12.0771566Z * [new tag] viable/strict/1762962692 -> viable/strict/1762962692 2025-12-04T09:17:12.0773217Z * [new tag] viable/strict/1762979143 -> viable/strict/1762979143 2025-12-04T09:17:12.0774822Z * [new tag] viable/strict/1762984188 -> viable/strict/1762984188 2025-12-04T09:17:12.0776315Z * [new tag] viable/strict/1762986306 -> viable/strict/1762986306 2025-12-04T09:17:12.0777984Z * [new tag] viable/strict/1762989903 -> viable/strict/1762989903 2025-12-04T09:17:12.0779635Z * [new tag] viable/strict/1762991377 -> viable/strict/1762991377 2025-12-04T09:17:12.0781241Z * [new tag] viable/strict/1762998921 -> viable/strict/1762998921 2025-12-04T09:17:12.0782965Z * [new tag] viable/strict/1763002287 -> viable/strict/1763002287 2025-12-04T09:17:12.0784639Z * [new tag] viable/strict/1763016840 -> viable/strict/1763016840 2025-12-04T09:17:12.0786251Z * [new tag] viable/strict/1763020180 -> viable/strict/1763020180 2025-12-04T09:17:12.0787942Z * [new tag] viable/strict/1763027421 -> viable/strict/1763027421 2025-12-04T09:17:12.0789607Z * [new tag] viable/strict/1763031120 -> viable/strict/1763031120 2025-12-04T09:17:12.0791297Z * [new tag] viable/strict/1763036861 -> viable/strict/1763036861 2025-12-04T09:17:12.0792914Z * [new tag] viable/strict/1763038993 -> viable/strict/1763038993 2025-12-04T09:17:12.0794847Z * [new tag] viable/strict/1763054703 -> viable/strict/1763054703 2025-12-04T09:17:12.0796083Z * [new tag] viable/strict/1763067061 -> viable/strict/1763067061 2025-12-04T09:17:12.0797769Z * [new tag] viable/strict/1763070847 -> viable/strict/1763070847 2025-12-04T09:17:12.0799347Z * [new tag] viable/strict/1763072706 -> viable/strict/1763072706 2025-12-04T09:17:12.0801274Z * [new tag] viable/strict/1763076302 -> viable/strict/1763076302 2025-12-04T09:17:12.0803048Z * [new tag] viable/strict/1763080816 -> viable/strict/1763080816 2025-12-04T09:17:12.0804737Z * [new tag] viable/strict/1763082732 -> viable/strict/1763082732 2025-12-04T09:17:12.0806278Z * [new tag] viable/strict/1763085329 -> viable/strict/1763085329 2025-12-04T09:17:12.0807892Z * [new tag] viable/strict/1763088623 -> viable/strict/1763088623 2025-12-04T09:17:12.0809707Z * [new tag] viable/strict/1763091402 -> viable/strict/1763091402 2025-12-04T09:17:12.0811317Z * [new tag] viable/strict/1763092602 -> viable/strict/1763092602 2025-12-04T09:17:12.0812868Z * [new tag] viable/strict/1763094355 -> viable/strict/1763094355 2025-12-04T09:17:12.0814585Z * [new tag] viable/strict/1763099390 -> viable/strict/1763099390 2025-12-04T09:17:12.0816228Z * [new tag] viable/strict/1763101608 -> viable/strict/1763101608 2025-12-04T09:17:12.0817895Z * [new tag] viable/strict/1763105102 -> viable/strict/1763105102 2025-12-04T09:17:12.0819581Z * [new tag] viable/strict/1763112347 -> viable/strict/1763112347 2025-12-04T09:17:12.0821164Z * [new tag] viable/strict/1763119471 -> viable/strict/1763119471 2025-12-04T09:17:12.0822816Z * [new tag] viable/strict/1763126835 -> viable/strict/1763126835 2025-12-04T09:17:12.0824167Z * [new tag] viable/strict/1763149779 -> viable/strict/1763149779 2025-12-04T09:17:12.0825757Z * [new tag] viable/strict/1763164178 -> viable/strict/1763164178 2025-12-04T09:17:12.0827398Z * [new tag] viable/strict/1763167104 -> viable/strict/1763167104 2025-12-04T09:17:12.0828956Z * [new tag] viable/strict/1763169132 -> viable/strict/1763169132 2025-12-04T09:17:12.0830602Z * [new tag] viable/strict/1763171708 -> viable/strict/1763171708 2025-12-04T09:17:12.0832154Z * [new tag] viable/strict/1763174759 -> viable/strict/1763174759 2025-12-04T09:17:12.0833850Z * [new tag] viable/strict/1763180744 -> viable/strict/1763180744 2025-12-04T09:17:12.0835438Z * [new tag] viable/strict/1763182227 -> viable/strict/1763182227 2025-12-04T09:17:12.0837128Z * [new tag] viable/strict/1763184309 -> viable/strict/1763184309 2025-12-04T09:17:12.0839671Z * [new tag] viable/strict/1763187991 -> viable/strict/1763187991 2025-12-04T09:17:12.0841119Z * [new tag] viable/strict/1763191445 -> viable/strict/1763191445 2025-12-04T09:17:12.0842858Z * [new tag] viable/strict/1763195152 -> viable/strict/1763195152 2025-12-04T09:17:12.0844309Z * [new tag] viable/strict/1763205769 -> viable/strict/1763205769 2025-12-04T09:17:12.0846482Z * [new tag] viable/strict/1763246990 -> viable/strict/1763246990 2025-12-04T09:17:12.0848150Z * [new tag] viable/strict/1763261578 -> viable/strict/1763261578 2025-12-04T09:17:12.0849680Z * [new tag] viable/strict/1763286573 -> viable/strict/1763286573 2025-12-04T09:17:12.0851158Z * [new tag] viable/strict/1763292167 -> viable/strict/1763292167 2025-12-04T09:17:12.0852816Z * [new tag] viable/strict/1763333386 -> viable/strict/1763333386 2025-12-04T09:17:12.0854416Z * [new tag] viable/strict/1763340082 -> viable/strict/1763340082 2025-12-04T09:17:12.0856712Z * [new tag] viable/strict/1763364324 -> viable/strict/1763364324 2025-12-04T09:17:12.0858460Z * [new tag] viable/strict/1763371569 -> viable/strict/1763371569 2025-12-04T09:17:12.0860027Z * [new tag] viable/strict/1763373067 -> viable/strict/1763373067 2025-12-04T09:17:12.0861630Z * [new tag] viable/strict/1763375157 -> viable/strict/1763375157 2025-12-04T09:17:12.0863277Z * [new tag] viable/strict/1763382462 -> viable/strict/1763382462 2025-12-04T09:17:12.0864961Z * [new tag] viable/strict/1763394661 -> viable/strict/1763394661 2025-12-04T09:17:12.0866750Z * [new tag] viable/strict/1763396797 -> viable/strict/1763396797 2025-12-04T09:17:12.0868426Z * [new tag] viable/strict/1763398542 -> viable/strict/1763398542 2025-12-04T09:17:12.0870127Z * [new tag] viable/strict/1763401807 -> viable/strict/1763401807 2025-12-04T09:17:12.0871619Z * [new tag] viable/strict/1763414698 -> viable/strict/1763414698 2025-12-04T09:17:12.0873246Z * [new tag] viable/strict/1763419807 -> viable/strict/1763419807 2025-12-04T09:17:12.0874833Z * [new tag] viable/strict/1763426369 -> viable/strict/1763426369 2025-12-04T09:17:12.0876483Z * [new tag] viable/strict/1763428331 -> viable/strict/1763428331 2025-12-04T09:17:12.0878409Z * [new tag] viable/strict/1763430922 -> viable/strict/1763430922 2025-12-04T09:17:12.0879702Z * [new tag] viable/strict/1763434184 -> viable/strict/1763434184 2025-12-04T09:17:12.0881356Z * [new tag] viable/strict/1763439973 -> viable/strict/1763439973 2025-12-04T09:17:12.0883157Z * [new tag] viable/strict/1763444995 -> viable/strict/1763444995 2025-12-04T09:17:12.0884614Z * [new tag] viable/strict/1763447206 -> viable/strict/1763447206 2025-12-04T09:17:12.0886260Z * [new tag] viable/strict/1763448826 -> viable/strict/1763448826 2025-12-04T09:17:12.0887875Z * [new tag] viable/strict/1763450717 -> viable/strict/1763450717 2025-12-04T09:17:12.0889564Z * [new tag] viable/strict/1763452183 -> viable/strict/1763452183 2025-12-04T09:17:12.0891263Z * [new tag] viable/strict/1763457945 -> viable/strict/1763457945 2025-12-04T09:17:12.0892830Z * [new tag] viable/strict/1763459439 -> viable/strict/1763459439 2025-12-04T09:17:12.0894322Z * [new tag] viable/strict/1763461556 -> viable/strict/1763461556 2025-12-04T09:17:12.0895960Z * [new tag] viable/strict/1763463103 -> viable/strict/1763463103 2025-12-04T09:17:12.0897629Z * [new tag] viable/strict/1763465100 -> viable/strict/1763465100 2025-12-04T09:17:12.0899110Z * [new tag] viable/strict/1763468866 -> viable/strict/1763468866 2025-12-04T09:17:12.0900860Z * [new tag] viable/strict/1763493823 -> viable/strict/1763493823 2025-12-04T09:17:12.0902299Z * [new tag] viable/strict/1763496249 -> viable/strict/1763496249 2025-12-04T09:17:12.0903874Z * [new tag] viable/strict/1763502620 -> viable/strict/1763502620 2025-12-04T09:17:12.0905517Z * [new tag] viable/strict/1763504715 -> viable/strict/1763504715 2025-12-04T09:17:12.0907129Z * [new tag] viable/strict/1763506208 -> viable/strict/1763506208 2025-12-04T09:17:12.0908900Z * [new tag] viable/strict/1763520590 -> viable/strict/1763520590 2025-12-04T09:17:12.0910560Z * [new tag] viable/strict/1763523357 -> viable/strict/1763523357 2025-12-04T09:17:12.0912265Z * [new tag] viable/strict/1763529922 -> viable/strict/1763529922 2025-12-04T09:17:12.0913949Z * [new tag] viable/strict/1763531408 -> viable/strict/1763531408 2025-12-04T09:17:12.0915559Z * [new tag] viable/strict/1763533622 -> viable/strict/1763533622 2025-12-04T09:17:12.0917136Z * [new tag] viable/strict/1763538576 -> viable/strict/1763538576 2025-12-04T09:17:12.0918810Z * [new tag] viable/strict/1763545823 -> viable/strict/1763545823 2025-12-04T09:17:12.0920434Z * [new tag] viable/strict/1763547951 -> viable/strict/1763547951 2025-12-04T09:17:12.0922170Z * [new tag] viable/strict/1763551477 -> viable/strict/1763551477 2025-12-04T09:17:12.0923714Z * [new tag] viable/strict/1763552982 -> viable/strict/1763552982 2025-12-04T09:17:12.0925386Z * [new tag] viable/strict/1763594698 -> viable/strict/1763594698 2025-12-04T09:17:12.0926974Z * [new tag] viable/strict/1763596178 -> viable/strict/1763596178 2025-12-04T09:17:12.0928657Z * [new tag] viable/strict/1763599155 -> viable/strict/1763599155 2025-12-04T09:17:12.0930177Z * [new tag] viable/strict/1763603717 -> viable/strict/1763603717 2025-12-04T09:17:12.0931985Z * [new tag] viable/strict/1763606923 -> viable/strict/1763606923 2025-12-04T09:17:12.0933582Z * [new tag] viable/strict/1763609715 -> viable/strict/1763609715 2025-12-04T09:17:12.0935114Z * [new tag] viable/strict/1763612757 -> viable/strict/1763612757 2025-12-04T09:17:12.0936692Z * [new tag] viable/strict/1763616325 -> viable/strict/1763616325 2025-12-04T09:17:12.0938283Z * [new tag] viable/strict/1763623509 -> viable/strict/1763623509 2025-12-04T09:17:12.0940057Z * [new tag] viable/strict/1763624984 -> viable/strict/1763624984 2025-12-04T09:17:12.0941819Z * [new tag] viable/strict/1763628796 -> viable/strict/1763628796 2025-12-04T09:17:12.0943217Z * [new tag] viable/strict/1763634343 -> viable/strict/1763634343 2025-12-04T09:17:12.0944801Z * [new tag] viable/strict/1763635867 -> viable/strict/1763635867 2025-12-04T09:17:12.0946563Z * [new tag] viable/strict/1763639382 -> viable/strict/1763639382 2025-12-04T09:17:12.0948125Z * [new tag] viable/strict/1763646626 -> viable/strict/1763646626 2025-12-04T09:17:12.0949934Z * [new tag] viable/strict/1763655997 -> viable/strict/1763655997 2025-12-04T09:17:12.0951978Z * [new tag] viable/strict/1763659444 -> viable/strict/1763659444 2025-12-04T09:17:12.0953570Z * [new tag] viable/strict/1763660992 -> viable/strict/1763660992 2025-12-04T09:17:12.0955125Z * [new tag] viable/strict/1763663201 -> viable/strict/1763663201 2025-12-04T09:17:12.0956817Z * [new tag] viable/strict/1763670362 -> viable/strict/1763670362 2025-12-04T09:17:12.0958265Z * [new tag] viable/strict/1763675378 -> viable/strict/1763675378 2025-12-04T09:17:12.0960007Z * [new tag] viable/strict/1763693343 -> viable/strict/1763693343 2025-12-04T09:17:12.0961605Z * [new tag] viable/strict/1763696088 -> viable/strict/1763696088 2025-12-04T09:17:12.0963451Z * [new tag] viable/strict/1763697343 -> viable/strict/1763697343 2025-12-04T09:17:12.0964998Z * [new tag] viable/strict/1763699165 -> viable/strict/1763699165 2025-12-04T09:17:12.0966608Z * [new tag] viable/strict/1763700660 -> viable/strict/1763700660 2025-12-04T09:17:12.0968237Z * [new tag] viable/strict/1763704209 -> viable/strict/1763704209 2025-12-04T09:17:12.0969910Z * [new tag] viable/strict/1763706411 -> viable/strict/1763706411 2025-12-04T09:17:12.0971466Z * [new tag] viable/strict/1763708082 -> viable/strict/1763708082 2025-12-04T09:17:12.0972968Z * [new tag] viable/strict/1763711381 -> viable/strict/1763711381 2025-12-04T09:17:12.0974484Z * [new tag] viable/strict/1763713593 -> viable/strict/1763713593 2025-12-04T09:17:12.0976158Z * [new tag] viable/strict/1763715201 -> viable/strict/1763715201 2025-12-04T09:17:12.0977734Z * [new tag] viable/strict/1763733017 -> viable/strict/1763733017 2025-12-04T09:17:12.0979328Z * [new tag] viable/strict/1763735108 -> viable/strict/1763735108 2025-12-04T09:17:12.0980929Z * [new tag] viable/strict/1763749579 -> viable/strict/1763749579 2025-12-04T09:17:12.0982521Z * [new tag] viable/strict/1763751113 -> viable/strict/1763751113 2025-12-04T09:17:12.0984177Z * [new tag] viable/strict/1763753035 -> viable/strict/1763753035 2025-12-04T09:17:12.0985831Z * [new tag] viable/strict/1763754578 -> viable/strict/1763754578 2025-12-04T09:17:12.0987437Z * [new tag] viable/strict/1763756748 -> viable/strict/1763756748 2025-12-04T09:17:12.0989293Z * [new tag] viable/strict/1763758205 -> viable/strict/1763758205 2025-12-04T09:17:12.0990477Z * [new tag] viable/strict/1763764050 -> viable/strict/1763764050 2025-12-04T09:17:12.0992095Z * [new tag] viable/strict/1763771887 -> viable/strict/1763771887 2025-12-04T09:17:12.0993813Z * [new tag] viable/strict/1763773920 -> viable/strict/1763773920 2025-12-04T09:17:12.0995379Z * [new tag] viable/strict/1763776501 -> viable/strict/1763776501 2025-12-04T09:17:12.0996975Z * [new tag] viable/strict/1763779437 -> viable/strict/1763779437 2025-12-04T09:17:12.0998871Z * [new tag] viable/strict/1763781038 -> viable/strict/1763781038 2025-12-04T09:17:12.1000894Z * [new tag] viable/strict/1763782245 -> viable/strict/1763782245 2025-12-04T09:17:12.1002365Z * [new tag] viable/strict/1763785568 -> viable/strict/1763785568 2025-12-04T09:17:12.1003956Z * [new tag] viable/strict/1763787006 -> viable/strict/1763787006 2025-12-04T09:17:12.1005673Z * [new tag] viable/strict/1763789103 -> viable/strict/1763789103 2025-12-04T09:17:12.1007233Z * [new tag] viable/strict/1763790578 -> viable/strict/1763790578 2025-12-04T09:17:12.1008888Z * [new tag] viable/strict/1763796275 -> viable/strict/1763796275 2025-12-04T09:17:12.1010666Z * [new tag] viable/strict/1763801465 -> viable/strict/1763801465 2025-12-04T09:17:12.1012277Z * [new tag] viable/strict/1763803522 -> viable/strict/1763803522 2025-12-04T09:17:12.1014037Z * [new tag] viable/strict/1763808581 -> viable/strict/1763808581 2025-12-04T09:17:12.1015492Z * [new tag] viable/strict/1763840977 -> viable/strict/1763840977 2025-12-04T09:17:12.1017078Z * [new tag] viable/strict/1763846659 -> viable/strict/1763846659 2025-12-04T09:17:12.1018691Z * [new tag] viable/strict/1763872065 -> viable/strict/1763872065 2025-12-04T09:17:12.1020367Z * [new tag] viable/strict/1763873648 -> viable/strict/1763873648 2025-12-04T09:17:12.1021960Z * [new tag] viable/strict/1763875506 -> viable/strict/1763875506 2025-12-04T09:17:12.1023436Z * [new tag] viable/strict/1763889904 -> viable/strict/1763889904 2025-12-04T09:17:12.1025010Z * [new tag] viable/strict/1763930999 -> viable/strict/1763930999 2025-12-04T09:17:12.1026584Z * [new tag] viable/strict/1763944964 -> viable/strict/1763944964 2025-12-04T09:17:12.1028052Z * [new tag] viable/strict/1763958474 -> viable/strict/1763958474 2025-12-04T09:17:12.1029693Z * [new tag] viable/strict/1763967263 -> viable/strict/1763967263 2025-12-04T09:17:12.1031300Z * [new tag] viable/strict/1763972803 -> viable/strict/1763972803 2025-12-04T09:17:12.1032850Z * [new tag] viable/strict/1763976376 -> viable/strict/1763976376 2025-12-04T09:17:12.1034472Z * [new tag] viable/strict/1763989404 -> viable/strict/1763989404 2025-12-04T09:17:12.1036349Z * [new tag] viable/strict/1763990887 -> viable/strict/1763990887 2025-12-04T09:17:12.1037956Z * [new tag] viable/strict/1764019919 -> viable/strict/1764019919 2025-12-04T09:17:12.1039702Z * [new tag] viable/strict/1764023134 -> viable/strict/1764023134 2025-12-04T09:17:12.1041218Z * [new tag] viable/strict/1764024593 -> viable/strict/1764024593 2025-12-04T09:17:12.1042770Z * [new tag] viable/strict/1764026706 -> viable/strict/1764026706 2025-12-04T09:17:12.1044616Z * [new tag] viable/strict/1764031139 -> viable/strict/1764031139 2025-12-04T09:17:12.1046284Z * [new tag] viable/strict/1764033131 -> viable/strict/1764033131 2025-12-04T09:17:12.1047735Z * [new tag] viable/strict/1764035725 -> viable/strict/1764035725 2025-12-04T09:17:12.1049207Z * [new tag] viable/strict/1764624265 -> viable/strict/1764624265 2025-12-04T09:17:12.1050653Z * [new tag] viable/strict/1764631514 -> viable/strict/1764631514 2025-12-04T09:17:12.1052056Z * [new tag] viable/strict/1764632987 -> viable/strict/1764632987 2025-12-04T09:17:12.1053531Z * [new tag] viable/strict/1764636063 -> viable/strict/1764636063 2025-12-04T09:17:12.1055387Z * [new tag] viable/strict/1764643975 -> viable/strict/1764643975 2025-12-04T09:17:12.1056839Z * [new tag] viable/strict/1764646859 -> viable/strict/1764646859 2025-12-04T09:17:12.1058391Z * [new tag] viable/strict/1764653120 -> viable/strict/1764653120 2025-12-04T09:17:12.1059770Z * [new tag] viable/strict/1764654632 -> viable/strict/1764654632 2025-12-04T09:17:12.1061198Z * [new tag] viable/strict/1764656821 -> viable/strict/1764656821 2025-12-04T09:17:12.1062645Z * [new tag] viable/strict/1764658557 -> viable/strict/1764658557 2025-12-04T09:17:12.1064028Z * [new tag] viable/strict/1764660333 -> viable/strict/1764660333 2025-12-04T09:17:12.1065827Z * [new tag] viable/strict/1764661812 -> viable/strict/1764661812 2025-12-04T09:17:12.1067015Z * [new tag] viable/strict/1764664023 -> viable/strict/1764664023 2025-12-04T09:17:12.1068484Z * [new tag] viable/strict/1764669150 -> viable/strict/1764669150 2025-12-04T09:17:12.1069906Z * [new tag] viable/strict/1764680709 -> viable/strict/1764680709 2025-12-04T09:17:12.1071377Z * [new tag] viable/strict/1764687619 -> viable/strict/1764687619 2025-12-04T09:17:12.1072803Z * [new tag] viable/strict/1764696355 -> viable/strict/1764696355 2025-12-04T09:17:12.1074262Z * [new tag] viable/strict/1764701767 -> viable/strict/1764701767 2025-12-04T09:17:12.1075700Z * [new tag] viable/strict/1764710768 -> viable/strict/1764710768 2025-12-04T09:17:12.1077128Z * [new tag] viable/strict/1764716202 -> viable/strict/1764716202 2025-12-04T09:17:12.1078662Z * [new tag] viable/strict/1764793566 -> viable/strict/1764793566 2025-12-04T09:17:12.1080115Z * [new tag] viable/strict/1764797093 -> viable/strict/1764797093 2025-12-04T09:17:12.1081555Z * [new tag] viable/strict/1764800729 -> viable/strict/1764800729 2025-12-04T09:17:12.1083138Z * [new tag] whc_flight_1 -> whc_flight_1 2025-12-04T09:17:12.1084802Z * [new tag] whc_flight_2 -> whc_flight_2 2025-12-04T09:17:12.1086543Z * [new tag] whc_flight_4 -> whc_flight_4 2025-12-04T09:17:12.2268878Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T09:17:12.2300738Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:17:12.2306511Z ##[endgroup] 2025-12-04T09:17:12.2306790Z ##[group]Determining the checkout info 2025-12-04T09:17:12.2308537Z ##[endgroup] 2025-12-04T09:17:12.2314060Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T09:17:12.2358078Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T09:17:12.2392778Z ##[group]Checking out the ref 2025-12-04T09:17:12.2396781Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:17:13.2675991Z Updating files: 65% (13210/20121) 2025-12-04T09:17:13.2758223Z Updating files: 66% (13280/20121) 2025-12-04T09:17:13.2840437Z Updating files: 67% (13482/20121) 2025-12-04T09:17:13.2921948Z Updating files: 68% (13683/20121) 2025-12-04T09:17:13.3134854Z Updating files: 69% (13884/20121) 2025-12-04T09:17:13.3467983Z Updating files: 70% (14085/20121) 2025-12-04T09:17:13.3539145Z Updating files: 71% (14286/20121) 2025-12-04T09:17:13.3632270Z Updating files: 72% (14488/20121) 2025-12-04T09:17:13.3861791Z Updating files: 73% (14689/20121) 2025-12-04T09:17:13.4137739Z Updating files: 74% (14890/20121) 2025-12-04T09:17:13.4703895Z Updating files: 75% (15091/20121) 2025-12-04T09:17:13.4884512Z Updating files: 76% (15292/20121) 2025-12-04T09:17:13.5051062Z Updating files: 77% (15494/20121) 2025-12-04T09:17:13.5291022Z Updating files: 78% (15695/20121) 2025-12-04T09:17:13.5591150Z Updating files: 79% (15896/20121) 2025-12-04T09:17:13.5951616Z Updating files: 80% (16097/20121) 2025-12-04T09:17:13.6275024Z Updating files: 81% (16299/20121) 2025-12-04T09:17:13.6533822Z Updating files: 82% (16500/20121) 2025-12-04T09:17:13.6725512Z Updating files: 83% (16701/20121) 2025-12-04T09:17:13.6901103Z Updating files: 84% (16902/20121) 2025-12-04T09:17:13.7099979Z Updating files: 85% (17103/20121) 2025-12-04T09:17:13.7292550Z Updating files: 86% (17305/20121) 2025-12-04T09:17:13.7469725Z Updating files: 87% (17506/20121) 2025-12-04T09:17:13.7620624Z Updating files: 88% (17707/20121) 2025-12-04T09:17:13.7798282Z Updating files: 89% (17908/20121) 2025-12-04T09:17:13.8009487Z Updating files: 90% (18109/20121) 2025-12-04T09:17:13.8162279Z Updating files: 91% (18311/20121) 2025-12-04T09:17:13.8352961Z Updating files: 92% (18512/20121) 2025-12-04T09:17:13.8574022Z Updating files: 93% (18713/20121) 2025-12-04T09:17:13.8818864Z Updating files: 94% (18914/20121) 2025-12-04T09:17:13.9034416Z Updating files: 95% (19115/20121) 2025-12-04T09:17:13.9229748Z Updating files: 96% (19317/20121) 2025-12-04T09:17:13.9431018Z Updating files: 97% (19518/20121) 2025-12-04T09:17:13.9759111Z Updating files: 98% (19719/20121) 2025-12-04T09:17:13.9971670Z Updating files: 99% (19920/20121) 2025-12-04T09:17:13.9972002Z Updating files: 100% (20121/20121) 2025-12-04T09:17:13.9972294Z Updating files: 100% (20121/20121), done. 2025-12-04T09:17:14.0263949Z Note: switching to 'ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32'. 2025-12-04T09:17:14.0264315Z 2025-12-04T09:17:14.0264521Z You are in 'detached HEAD' state. You can look around, make experimental 2025-12-04T09:17:14.0265033Z changes and commit them, and you can discard any commits you make in this 2025-12-04T09:17:14.0265546Z state without impacting any branches by switching back to a branch. 2025-12-04T09:17:14.0265843Z 2025-12-04T09:17:14.0266039Z If you want to create a new branch to retain commits you create, you may 2025-12-04T09:17:14.0266511Z do so (now or later) by using -c with the switch command. Example: 2025-12-04T09:17:14.0266779Z 2025-12-04T09:17:14.0266901Z git switch -c 2025-12-04T09:17:14.0267092Z 2025-12-04T09:17:14.0267214Z Or undo this operation with: 2025-12-04T09:17:14.0267387Z 2025-12-04T09:17:14.0267474Z git switch - 2025-12-04T09:17:14.0267607Z 2025-12-04T09:17:14.0267835Z Turn off this advice by setting config variable advice.detachedHead to false 2025-12-04T09:17:14.0268163Z 2025-12-04T09:17:14.0269874Z HEAD is now at ffd9b0fb435 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T09:17:14.0452697Z ##[endgroup] 2025-12-04T09:17:14.0453352Z ##[group]Setting up auth for fetching submodules 2025-12-04T09:17:14.0460397Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T09:17:14.0522320Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T09:17:14.0558598Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T09:17:14.0594162Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T09:17:14.0628381Z ##[endgroup] 2025-12-04T09:17:14.0629104Z ##[group]Fetching submodules 2025-12-04T09:17:14.0632227Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T09:17:14.1050160Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T09:17:14.1456347Z Submodule 'android/libs/fbjni' (https://github.com/facebookincubator/fbjni.git) registered for path 'android/libs/fbjni' 2025-12-04T09:17:14.1458495Z Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/FP16' 2025-12-04T09:17:14.1462156Z Submodule 'third_party/NNPACK_deps/FXdiv' (https://github.com/Maratyszcza/FXdiv.git) registered for path 'third_party/FXdiv' 2025-12-04T09:17:14.1465930Z Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK' 2025-12-04T09:17:14.1469823Z Submodule 'third_party/NVTX' (https://github.com/NVIDIA/NVTX.git) registered for path 'third_party/NVTX' 2025-12-04T09:17:14.1474389Z Submodule 'third_party/VulkanMemoryAllocator' (https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git) registered for path 'third_party/VulkanMemoryAllocator' 2025-12-04T09:17:14.1477895Z Submodule 'third_party/XNNPACK' (https://github.com/google/XNNPACK.git) registered for path 'third_party/XNNPACK' 2025-12-04T09:17:14.1482044Z Submodule 'third_party/aiter' (https://github.com/ROCm/aiter.git) registered for path 'third_party/aiter' 2025-12-04T09:17:14.1486222Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark' 2025-12-04T09:17:14.1490897Z Submodule 'third_party/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/composable_kernel' 2025-12-04T09:17:14.1495183Z Submodule 'third_party/cpp-httplib' (https://github.com/yhirose/cpp-httplib.git) registered for path 'third_party/cpp-httplib' 2025-12-04T09:17:14.1499624Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo.git) registered for path 'third_party/cpuinfo' 2025-12-04T09:17:14.1504843Z Submodule 'third_party/cudnn_frontend' (https://github.com/NVIDIA/cudnn-frontend.git) registered for path 'third_party/cudnn_frontend' 2025-12-04T09:17:14.1509379Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/cutlass' 2025-12-04T09:17:14.1513957Z Submodule 'third_party/fbgemm' (https://github.com/pytorch/fbgemm) registered for path 'third_party/fbgemm' 2025-12-04T09:17:14.1519922Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'third_party/flash-attention' 2025-12-04T09:17:14.1528280Z Submodule 'third_party/flatbuffers' (https://github.com/google/flatbuffers.git) registered for path 'third_party/flatbuffers' 2025-12-04T09:17:14.1532909Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/fmt' 2025-12-04T09:17:14.1538087Z Submodule 'third_party/gemmlowp/gemmlowp' (https://github.com/google/gemmlowp.git) registered for path 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:17:14.1542944Z Submodule 'third_party/gloo' (https://github.com/pytorch/gloo) registered for path 'third_party/gloo' 2025-12-04T09:17:14.1548244Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest' 2025-12-04T09:17:14.1556308Z Submodule 'third_party/ideep' (https://github.com/intel/ideep) registered for path 'third_party/ideep' 2025-12-04T09:17:14.1559666Z Submodule 'third_party/ittapi' (https://github.com/intel/ittapi.git) registered for path 'third_party/ittapi' 2025-12-04T09:17:14.1565124Z Submodule 'third_party/kineto' (https://github.com/pytorch/kineto) registered for path 'third_party/kineto' 2025-12-04T09:17:14.1570761Z Submodule 'third_party/kleidiai' (https://github.com/ARM-software/kleidiai.git) registered for path 'third_party/kleidiai' 2025-12-04T09:17:14.1576213Z Submodule 'third_party/mimalloc' (https://github.com/microsoft/mimalloc.git) registered for path 'third_party/mimalloc' 2025-12-04T09:17:14.1581997Z Submodule 'third_party/nlohmann' (https://github.com/nlohmann/json.git) registered for path 'third_party/nlohmann' 2025-12-04T09:17:14.1587622Z Submodule 'third_party/onnx' (https://github.com/onnx/onnx.git) registered for path 'third_party/onnx' 2025-12-04T09:17:14.1593590Z Submodule 'third_party/opentelemetry-cpp' (https://github.com/open-telemetry/opentelemetry-cpp.git) registered for path 'third_party/opentelemetry-cpp' 2025-12-04T09:17:14.1599196Z Submodule 'third_party/pocketfft' (https://github.com/mreineck/pocketfft) registered for path 'third_party/pocketfft' 2025-12-04T09:17:14.1606876Z Submodule 'third_party/protobuf' (https://github.com/protocolbuffers/protobuf.git) registered for path 'third_party/protobuf' 2025-12-04T09:17:14.1612599Z Submodule 'third_party/NNPACK_deps/psimd' (https://github.com/Maratyszcza/psimd.git) registered for path 'third_party/psimd' 2025-12-04T09:17:14.1618934Z Submodule 'third_party/NNPACK_deps/pthreadpool' (https://github.com/Maratyszcza/pthreadpool.git) registered for path 'third_party/pthreadpool' 2025-12-04T09:17:14.1628329Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/pybind11' 2025-12-04T09:17:14.1634590Z Submodule 'third_party/python-peachpy' (https://github.com/malfet/PeachPy.git) registered for path 'third_party/python-peachpy' 2025-12-04T09:17:14.1641020Z Submodule 'third_party/sleef' (https://github.com/shibatch/sleef) registered for path 'third_party/sleef' 2025-12-04T09:17:14.1647444Z Submodule 'third_party/tensorpipe' (https://github.com/pytorch/tensorpipe.git) registered for path 'third_party/tensorpipe' 2025-12-04T09:17:14.1687287Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/android/libs/fbjni'... 2025-12-04T09:17:14.4054486Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FXdiv'... 2025-12-04T09:17:14.4055485Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FP16'... 2025-12-04T09:17:14.4093500Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fmt'... 2025-12-04T09:17:17.2346487Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NNPACK'... 2025-12-04T09:17:17.2347968Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/benchmark'... 2025-12-04T09:17:17.2349810Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NVTX'... 2025-12-04T09:17:17.2350901Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gloo'... 2025-12-04T09:17:17.2352088Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gemmlowp/gemmlowp'... 2025-12-04T09:17:17.2353430Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention'... 2025-12-04T09:17:17.2354915Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpuinfo'... 2025-12-04T09:17:17.2356277Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpp-httplib'... 2025-12-04T09:17:17.2357494Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep'... 2025-12-04T09:17:17.2358505Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ittapi'... 2025-12-04T09:17:17.2361174Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kleidiai'... 2025-12-04T09:17:17.2362393Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pocketfft'... 2025-12-04T09:17:17.2363688Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cudnn_frontend'... 2025-12-04T09:17:17.2364917Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/psimd'... 2025-12-04T09:17:17.2365944Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/googletest'... 2025-12-04T09:17:17.2517492Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flatbuffers'... 2025-12-04T09:17:17.5526213Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/mimalloc'... 2025-12-04T09:17:17.5527398Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pthreadpool'... 2025-12-04T09:17:17.5656227Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp'... 2025-12-04T09:17:31.9803974Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/VulkanMemoryAllocator'... 2025-12-04T09:17:31.9804877Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/python-peachpy'... 2025-12-04T09:17:31.9806143Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe'... 2025-12-04T09:17:31.9807352Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto'... 2025-12-04T09:17:31.9808493Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/sleef'... 2025-12-04T09:17:31.9809932Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pybind11'... 2025-12-04T09:17:31.9810780Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cutlass'... 2025-12-04T09:17:31.9811847Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm'... 2025-12-04T09:17:31.9812736Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx'... 2025-12-04T09:17:31.9813529Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/composable_kernel'... 2025-12-04T09:17:31.9814678Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nlohmann'... 2025-12-04T09:17:32.0805854Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/XNNPACK'... 2025-12-04T09:17:38.6230653Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/aiter'... 2025-12-04T09:17:38.6231357Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf'... 2025-12-04T09:17:38.6446676Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T09:17:38.6620874Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T09:17:38.6762434Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T09:17:38.7113003Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T09:17:38.8233559Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T09:17:38.8893346Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T09:17:39.8796287Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T09:17:40.1062232Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T09:17:40.1091790Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:17:40.1127565Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/aiter/3rdparty/composable_kernel'... 2025-12-04T09:17:45.8579828Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T09:17:45.8912492Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T09:17:46.3703801Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T09:17:46.4312482Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T09:17:46.6082940Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T09:17:46.6689943Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T09:17:47.5158768Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T09:17:47.7194715Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T09:17:47.7224688Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'third_party/fbgemm/external/asmjit' 2025-12-04T09:17:47.7227859Z Submodule 'external/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:17:47.7231502Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:17:47.7235435Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'third_party/fbgemm/external/cutlass' 2025-12-04T09:17:47.7239772Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'third_party/fbgemm/external/googletest' 2025-12-04T09:17:47.7243884Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:17:47.7247917Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/fbgemm/external/json' 2025-12-04T09:17:47.7283398Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/asmjit'... 2025-12-04T09:17:48.9788810Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/hipify_torch'... 2025-12-04T09:17:48.9789679Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/cpuinfo'... 2025-12-04T09:17:48.9790474Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/googletest'... 2025-12-04T09:17:49.0790183Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/composable_kernel'... 2025-12-04T09:17:52.6858438Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/cutlass'... 2025-12-04T09:17:52.7858498Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/json'... 2025-12-04T09:17:55.5816302Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T09:17:56.0575843Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T09:17:56.1799608Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T09:17:57.0030431Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T09:17:57.0600461Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:17:57.0760413Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T09:17:57.2157971Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T09:17:57.3112493Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T09:17:57.3138597Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:17:57.3141742Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:17:57.3179326Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention/csrc/composable_kernel'... 2025-12-04T09:18:02.2381720Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention/csrc/cutlass'... 2025-12-04T09:18:02.5754616Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T09:18:03.3070352Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T09:18:03.4954868Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T09:18:03.5331980Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T09:18:03.5820611Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T09:18:03.6179164Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T09:18:03.6741581Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:18:03.6926909Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T09:18:03.6949898Z Submodule 'mkl-dnn' (https://github.com/intel/mkl-dnn.git) registered for path 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:03.6982174Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep/mkl-dnn'... 2025-12-04T09:18:21.2649206Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T09:18:21.2927917Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T09:18:21.3951389Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T09:18:21.3979720Z Submodule 'libkineto/third_party/dynolog' (https://github.com/facebookincubator/dynolog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:21.3982652Z Submodule 'libkineto/third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:21.3986792Z Submodule 'libkineto/third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:21.4021775Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog'... 2025-12-04T09:18:22.1423927Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/fmt'... 2025-12-04T09:18:22.7475049Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/googletest'... 2025-12-04T09:18:22.8572676Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T09:18:22.8595753Z Submodule 'third_party/DCGM' (https://github.com/NVIDIA/DCGM.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:22.8599706Z Submodule 'third_party/cpr' (https://github.com/libcpr/cpr.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:22.8604332Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:22.8608225Z Submodule 'third_party/gflags' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:22.8612613Z Submodule 'third_party/glog' (https://github.com/google/glog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:22.8616925Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:22.8622029Z Submodule 'third_party/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:22.8626496Z Submodule 'third_party/pfs' (https://github.com/dtrugman/pfs.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:22.8631155Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:22.8666237Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM'... 2025-12-04T09:18:24.8610277Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/pfs'... 2025-12-04T09:18:24.8611426Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp'... 2025-12-04T09:18:24.8612798Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags'... 2025-12-04T09:18:24.8613842Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/cpr'... 2025-12-04T09:18:24.8615027Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/glog'... 2025-12-04T09:18:24.8616100Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/googletest'... 2025-12-04T09:18:24.8617212Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/fmt'... 2025-12-04T09:18:24.9610292Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/json'... 2025-12-04T09:18:32.0551750Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T09:18:32.0805948Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T09:18:32.1274932Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T09:18:32.1467839Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T09:18:32.1491235Z Submodule 'doc' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:32.1528020Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc'... 2025-12-04T09:18:32.4409350Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T09:18:32.4662765Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T09:18:32.5227566Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:18:32.6518699Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T09:18:32.6748425Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T09:18:32.6994768Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T09:18:32.7018693Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:32.7021678Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:32.7057123Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb'... 2025-12-04T09:18:34.8790156Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest'... 2025-12-04T09:18:35.1735782Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T09:18:35.2318067Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T09:18:35.2730834Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T09:18:35.3333315Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:18:35.4069234Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T09:18:35.4615879Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T09:18:35.6010031Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T09:18:36.2312391Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T09:18:36.2353231Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:36.2386657Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/pybind11'... 2025-12-04T09:18:37.2010658Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T09:18:37.3008579Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T09:18:37.3034919Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark) registered for path 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:37.3039373Z Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:37.3042530Z Submodule 'third_party/ms-gsl' (https://github.com/microsoft/GSL) registered for path 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:37.3046624Z Submodule 'third_party/nlohmann-json' (https://github.com/nlohmann/json) registered for path 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:37.3050915Z Submodule 'third_party/opentelemetry-proto' (https://github.com/open-telemetry/opentelemetry-proto) registered for path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:37.3055045Z Submodule 'third_party/opentracing-cpp' (https://github.com/opentracing/opentracing-cpp.git) registered for path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:37.3059366Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:37.3063609Z Submodule 'tools/vcpkg' (https://github.com/Microsoft/vcpkg) registered for path 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:37.3099241Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/benchmark'... 2025-12-04T09:18:37.7646898Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentracing-cpp'... 2025-12-04T09:18:37.7649857Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentelemetry-proto'... 2025-12-04T09:18:37.7651086Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp'... 2025-12-04T09:18:37.7652106Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/ms-gsl'... 2025-12-04T09:18:37.8648476Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/googletest'... 2025-12-04T09:18:38.4763708Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/nlohmann-json'... 2025-12-04T09:18:45.7668680Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/tools/vcpkg'... 2025-12-04T09:18:46.5078684Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T09:18:46.5593183Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T09:18:46.5816969Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T09:18:46.7178413Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T09:18:46.7366526Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T09:18:46.7580183Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T09:18:46.7815451Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T09:18:46.7838273Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:46.7842094Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:46.7876680Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb'... 2025-12-04T09:18:49.0084160Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest'... 2025-12-04T09:18:49.3000595Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T09:18:49.3575820Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T09:18:50.0871957Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T09:18:50.1037174Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T09:18:50.4430087Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T09:18:50.4458153Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:50.4462318Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:50.4495900Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/benchmark'... 2025-12-04T09:18:51.0065124Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/googletest'... 2025-12-04T09:18:51.4729485Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T09:18:51.5596879Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T09:18:51.5742352Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T09:18:51.5916752Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T09:18:51.6479603Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T09:18:51.6848916Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T09:18:51.7396202Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T09:18:51.7784865Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T09:18:51.7808539Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:51.7812252Z Submodule 'third_party/libnop' (https://github.com/google/libnop.git) registered for path 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:51.7820337Z Submodule 'third_party/libuv' (https://github.com/libuv/libuv.git) registered for path 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:51.7821341Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:51.7858158Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/googletest'... 2025-12-04T09:18:52.9595464Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libnop'... 2025-12-04T09:18:52.9596351Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11'... 2025-12-04T09:18:52.9983335Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libuv'... 2025-12-04T09:18:53.0671346Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T09:18:53.0891791Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T09:18:53.1798134Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T09:18:53.2174674Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T09:18:53.2199595Z Submodule 'tools/clang' (https://github.com/wjakob/clang-cindex-python3) registered for path 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:53.2237963Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11/tools/clang'... 2025-12-04T09:18:53.4034067Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T09:18:53.4088157Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T09:18:53.4487062Z Entering 'android/libs/fbjni' 2025-12-04T09:18:53.4548026Z Entering 'third_party/FP16' 2025-12-04T09:18:53.4611532Z Entering 'third_party/FXdiv' 2025-12-04T09:18:53.4670513Z Entering 'third_party/NNPACK' 2025-12-04T09:18:53.4732993Z Entering 'third_party/NVTX' 2025-12-04T09:18:53.4794465Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:18:53.4855399Z Entering 'third_party/XNNPACK' 2025-12-04T09:18:53.4930758Z Entering 'third_party/aiter' 2025-12-04T09:18:53.4992709Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:18:53.5062927Z Entering 'third_party/benchmark' 2025-12-04T09:18:53.5123488Z Entering 'third_party/composable_kernel' 2025-12-04T09:18:53.5191847Z Entering 'third_party/cpp-httplib' 2025-12-04T09:18:53.5253711Z Entering 'third_party/cpuinfo' 2025-12-04T09:18:53.5314773Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:18:53.5374693Z Entering 'third_party/cutlass' 2025-12-04T09:18:53.5445028Z Entering 'third_party/fbgemm' 2025-12-04T09:18:53.5506417Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:18:53.5564639Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:18:53.5633309Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:18:53.5690960Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:18:53.5761638Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:18:53.5818783Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:18:53.5875972Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:18:53.5937411Z Entering 'third_party/flash-attention' 2025-12-04T09:18:53.5998433Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:18:53.6065186Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:18:53.6135419Z Entering 'third_party/flatbuffers' 2025-12-04T09:18:53.6198770Z Entering 'third_party/fmt' 2025-12-04T09:18:53.6261868Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:18:53.6322651Z Entering 'third_party/gloo' 2025-12-04T09:18:53.6382443Z Entering 'third_party/googletest' 2025-12-04T09:18:53.6442689Z Entering 'third_party/ideep' 2025-12-04T09:18:53.6500914Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:53.6569928Z Entering 'third_party/ittapi' 2025-12-04T09:18:53.6630028Z Entering 'third_party/kineto' 2025-12-04T09:18:53.6689775Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:53.6745343Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:53.6805679Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:53.6862905Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:53.6920748Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:53.6976786Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:53.7040975Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:53.7102498Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:53.7161472Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:53.7224609Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:53.7286265Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:53.7342984Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:53.7403819Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:53.7471490Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:53.7529355Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:53.7590706Z Entering 'third_party/kleidiai' 2025-12-04T09:18:53.7653816Z Entering 'third_party/mimalloc' 2025-12-04T09:18:53.7713679Z Entering 'third_party/nlohmann' 2025-12-04T09:18:53.7775148Z Entering 'third_party/onnx' 2025-12-04T09:18:53.7855806Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:53.7924253Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:18:53.7983331Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:53.8042846Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:53.8102581Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:53.8159616Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:53.8222677Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:53.8280737Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:53.8338746Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:53.8393475Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:53.8453709Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:53.8515271Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:53.8595881Z Entering 'third_party/pocketfft' 2025-12-04T09:18:53.8655084Z Entering 'third_party/protobuf' 2025-12-04T09:18:53.8717543Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:53.8775822Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:53.8838333Z Entering 'third_party/psimd' 2025-12-04T09:18:53.8898658Z Entering 'third_party/pthreadpool' 2025-12-04T09:18:53.8957788Z Entering 'third_party/pybind11' 2025-12-04T09:18:53.9017668Z Entering 'third_party/python-peachpy' 2025-12-04T09:18:53.9078717Z Entering 'third_party/sleef' 2025-12-04T09:18:53.9138718Z Entering 'third_party/tensorpipe' 2025-12-04T09:18:53.9196600Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:53.9254721Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:53.9312654Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:53.9370020Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:53.9423848Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:53.9507411Z ##[endgroup] 2025-12-04T09:18:53.9508243Z ##[group]Persisting credentials for submodules 2025-12-04T09:18:53.9515319Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T09:18:53.9917265Z Entering 'android/libs/fbjni' 2025-12-04T09:18:53.9998152Z Entering 'third_party/FP16' 2025-12-04T09:18:54.0080457Z Entering 'third_party/FXdiv' 2025-12-04T09:18:54.0158964Z Entering 'third_party/NNPACK' 2025-12-04T09:18:54.0241451Z Entering 'third_party/NVTX' 2025-12-04T09:18:54.0318675Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:18:54.0395154Z Entering 'third_party/XNNPACK' 2025-12-04T09:18:54.0494610Z Entering 'third_party/aiter' 2025-12-04T09:18:54.0573531Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:18:54.0662630Z Entering 'third_party/benchmark' 2025-12-04T09:18:54.0742167Z Entering 'third_party/composable_kernel' 2025-12-04T09:18:54.0830243Z Entering 'third_party/cpp-httplib' 2025-12-04T09:18:54.0908580Z Entering 'third_party/cpuinfo' 2025-12-04T09:18:54.0986721Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:18:54.1067025Z Entering 'third_party/cutlass' 2025-12-04T09:18:54.1155010Z Entering 'third_party/fbgemm' 2025-12-04T09:18:54.1234034Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:18:54.1312124Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:18:54.1397117Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:18:54.1476525Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:18:54.1562632Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:18:54.1639682Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:18:54.1716105Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:18:54.1799122Z Entering 'third_party/flash-attention' 2025-12-04T09:18:54.1881326Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:18:54.1963997Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:18:54.2050884Z Entering 'third_party/flatbuffers' 2025-12-04T09:18:54.2133825Z Entering 'third_party/fmt' 2025-12-04T09:18:54.2213921Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:18:54.2293841Z Entering 'third_party/gloo' 2025-12-04T09:18:54.2373819Z Entering 'third_party/googletest' 2025-12-04T09:18:54.2453037Z Entering 'third_party/ideep' 2025-12-04T09:18:54.2528546Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:54.2614578Z Entering 'third_party/ittapi' 2025-12-04T09:18:54.2692090Z Entering 'third_party/kineto' 2025-12-04T09:18:54.2769690Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:54.2847007Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:54.2926442Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:54.3003852Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:54.3081721Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:54.3160150Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:54.3245526Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:54.3323984Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:54.3401765Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:54.3485032Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:54.3563524Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:54.3640584Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:54.3721227Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:54.3812227Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:54.3887428Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:54.3967854Z Entering 'third_party/kleidiai' 2025-12-04T09:18:54.4049309Z Entering 'third_party/mimalloc' 2025-12-04T09:18:54.4129623Z Entering 'third_party/nlohmann' 2025-12-04T09:18:54.4210410Z Entering 'third_party/onnx' 2025-12-04T09:18:54.4304020Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:54.4386897Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:18:54.4467184Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:54.4543920Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:54.4623241Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:54.4698054Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:54.4774599Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:54.4852243Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:54.4927250Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:54.5002625Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:54.5081136Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:54.5164113Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:54.5263401Z Entering 'third_party/pocketfft' 2025-12-04T09:18:54.5347018Z Entering 'third_party/protobuf' 2025-12-04T09:18:54.5428180Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:54.5503996Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:54.5584226Z Entering 'third_party/psimd' 2025-12-04T09:18:54.5664258Z Entering 'third_party/pthreadpool' 2025-12-04T09:18:54.5743607Z Entering 'third_party/pybind11' 2025-12-04T09:18:54.5822426Z Entering 'third_party/python-peachpy' 2025-12-04T09:18:54.5901181Z Entering 'third_party/sleef' 2025-12-04T09:18:54.5979832Z Entering 'third_party/tensorpipe' 2025-12-04T09:18:54.6057792Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:54.6132925Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:54.6208953Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:54.6283761Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:54.6357710Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:54.6466038Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T09:18:54.6874041Z Entering 'android/libs/fbjni' 2025-12-04T09:18:54.6948184Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T09:18:54.6972827Z Entering 'third_party/FP16' 2025-12-04T09:18:54.7044812Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T09:18:54.7069998Z Entering 'third_party/FXdiv' 2025-12-04T09:18:54.7146188Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T09:18:54.7174799Z Entering 'third_party/NNPACK' 2025-12-04T09:18:54.7246945Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T09:18:54.7271931Z Entering 'third_party/NVTX' 2025-12-04T09:18:54.7343615Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T09:18:54.7369620Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:18:54.7441771Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T09:18:54.7467251Z Entering 'third_party/XNNPACK' 2025-12-04T09:18:54.7539201Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T09:18:54.7579807Z Entering 'third_party/aiter' 2025-12-04T09:18:54.7651174Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T09:18:54.7677672Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:18:54.7749967Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T09:18:54.7785543Z Entering 'third_party/benchmark' 2025-12-04T09:18:54.7857995Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:18:54.7884770Z Entering 'third_party/composable_kernel' 2025-12-04T09:18:54.7957977Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T09:18:54.7990838Z Entering 'third_party/cpp-httplib' 2025-12-04T09:18:54.8062994Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T09:18:54.8088744Z Entering 'third_party/cpuinfo' 2025-12-04T09:18:54.8158983Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T09:18:54.8185821Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:18:54.8260314Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T09:18:54.8284416Z Entering 'third_party/cutlass' 2025-12-04T09:18:54.8357638Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T09:18:54.8394301Z Entering 'third_party/fbgemm' 2025-12-04T09:18:54.8467506Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T09:18:54.8491636Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:18:54.8563233Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T09:18:54.8588631Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:18:54.8659312Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T09:18:54.8692072Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:18:54.8763868Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T09:18:54.8788063Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:18:54.8867073Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T09:18:54.8902111Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:18:54.8973461Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T09:18:54.8997228Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:18:54.9068924Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T09:18:54.9092772Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:18:54.9163823Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T09:18:54.9193058Z Entering 'third_party/flash-attention' 2025-12-04T09:18:54.9266086Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T09:18:54.9289750Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:18:54.9361655Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T09:18:54.9391875Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:18:54.9467955Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T09:18:54.9505025Z Entering 'third_party/flatbuffers' 2025-12-04T09:18:54.9579266Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T09:18:54.9609516Z Entering 'third_party/fmt' 2025-12-04T09:18:54.9680901Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T09:18:54.9707556Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:18:54.9780013Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T09:18:54.9806662Z Entering 'third_party/gloo' 2025-12-04T09:18:54.9880651Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T09:18:54.9905771Z Entering 'third_party/googletest' 2025-12-04T09:18:54.9977965Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:18:55.0003411Z Entering 'third_party/ideep' 2025-12-04T09:18:55.0074475Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T09:18:55.0096798Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:55.0167440Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T09:18:55.0203902Z Entering 'third_party/ittapi' 2025-12-04T09:18:55.0276919Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T09:18:55.0301557Z Entering 'third_party/kineto' 2025-12-04T09:18:55.0373288Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T09:18:55.0396317Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:55.0469434Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T09:18:55.0491154Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:55.0562712Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T09:18:55.0588489Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:55.0662262Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T09:18:55.0686661Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:55.0758802Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T09:18:55.0781627Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:55.0854305Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T09:18:55.0877196Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:55.0949617Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T09:18:55.0977681Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:55.1050725Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T09:18:55.1075415Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:55.1149167Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:18:55.1172345Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:55.1243824Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T09:18:55.1269002Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:55.1343122Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T09:18:55.1367951Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:55.1439170Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T09:18:55.1461141Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:55.1535618Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T09:18:55.1562114Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:55.1636646Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T09:18:55.1669408Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:55.1741207Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T09:18:55.1764474Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:55.1835221Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T09:18:55.1862738Z Entering 'third_party/kleidiai' 2025-12-04T09:18:55.1936396Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T09:18:55.1962978Z Entering 'third_party/mimalloc' 2025-12-04T09:18:55.2035602Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T09:18:55.2060185Z Entering 'third_party/nlohmann' 2025-12-04T09:18:55.2132665Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T09:18:55.2159740Z Entering 'third_party/onnx' 2025-12-04T09:18:55.2231109Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T09:18:55.2271284Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:55.2343087Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:18:55.2373959Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:18:55.2446116Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T09:18:55.2472122Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:55.2543056Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:18:55.2567604Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:55.2638567Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:18:55.2662601Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:55.2733133Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T09:18:55.2756829Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:55.2826894Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T09:18:55.2853225Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:55.2924658Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T09:18:55.2949052Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:55.3021814Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T09:18:55.3045275Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:55.3116352Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T09:18:55.3138289Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:55.3210537Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T09:18:55.3236641Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:55.3309165Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T09:18:55.3337552Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:55.3409685Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T09:18:55.3456312Z Entering 'third_party/pocketfft' 2025-12-04T09:18:55.3531706Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T09:18:55.3556013Z Entering 'third_party/protobuf' 2025-12-04T09:18:55.3628548Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T09:18:55.3656733Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:55.3728720Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:18:55.3752857Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:55.3826542Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:18:55.3856077Z Entering 'third_party/psimd' 2025-12-04T09:18:55.3929104Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T09:18:55.3953627Z Entering 'third_party/pthreadpool' 2025-12-04T09:18:55.4028855Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T09:18:55.4052568Z Entering 'third_party/pybind11' 2025-12-04T09:18:55.4124603Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:18:55.4149467Z Entering 'third_party/python-peachpy' 2025-12-04T09:18:55.4226462Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T09:18:55.4253455Z Entering 'third_party/sleef' 2025-12-04T09:18:55.4327752Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T09:18:55.4353111Z Entering 'third_party/tensorpipe' 2025-12-04T09:18:55.4424878Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T09:18:55.4449357Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:55.4518690Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:18:55.4542945Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:55.4618190Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T09:18:55.4641382Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:55.4716799Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T09:18:55.4740431Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:55.4810713Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:18:55.4832219Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:55.4904009Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T09:18:55.5700928Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T09:18:55.6100212Z Entering 'android/libs/fbjni' 2025-12-04T09:18:55.6162317Z Entering 'third_party/FP16' 2025-12-04T09:18:55.6222750Z Entering 'third_party/FXdiv' 2025-12-04T09:18:55.6283069Z Entering 'third_party/NNPACK' 2025-12-04T09:18:55.6344169Z Entering 'third_party/NVTX' 2025-12-04T09:18:55.6406555Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:18:55.6467270Z Entering 'third_party/XNNPACK' 2025-12-04T09:18:55.6541719Z Entering 'third_party/aiter' 2025-12-04T09:18:55.6601920Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:18:55.6670743Z Entering 'third_party/benchmark' 2025-12-04T09:18:55.6732382Z Entering 'third_party/composable_kernel' 2025-12-04T09:18:55.6801344Z Entering 'third_party/cpp-httplib' 2025-12-04T09:18:55.6860299Z Entering 'third_party/cpuinfo' 2025-12-04T09:18:55.6922639Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:18:55.6983140Z Entering 'third_party/cutlass' 2025-12-04T09:18:55.7053105Z Entering 'third_party/fbgemm' 2025-12-04T09:18:55.7114670Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:18:55.7172352Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:18:55.7239180Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:18:55.7299043Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:18:55.7365437Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:18:55.7424628Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:18:55.7482178Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:18:55.7547225Z Entering 'third_party/flash-attention' 2025-12-04T09:18:55.7607092Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:18:55.7671683Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:18:55.7741231Z Entering 'third_party/flatbuffers' 2025-12-04T09:18:55.7805194Z Entering 'third_party/fmt' 2025-12-04T09:18:55.7864254Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:18:55.7927259Z Entering 'third_party/gloo' 2025-12-04T09:18:55.7988066Z Entering 'third_party/googletest' 2025-12-04T09:18:55.8050644Z Entering 'third_party/ideep' 2025-12-04T09:18:55.8108284Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:55.8175046Z Entering 'third_party/ittapi' 2025-12-04T09:18:55.8236449Z Entering 'third_party/kineto' 2025-12-04T09:18:55.8295100Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:55.8355622Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:55.8419211Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:55.8477783Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:55.8535910Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:55.8591800Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:55.8656087Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:55.8713974Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:55.8777052Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:55.8837248Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:55.8895010Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:55.8951752Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:55.9015558Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:55.9086511Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:55.9143511Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:55.9204649Z Entering 'third_party/kleidiai' 2025-12-04T09:18:55.9266781Z Entering 'third_party/mimalloc' 2025-12-04T09:18:55.9327769Z Entering 'third_party/nlohmann' 2025-12-04T09:18:55.9389093Z Entering 'third_party/onnx' 2025-12-04T09:18:55.9465625Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:55.9538650Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:18:55.9598745Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:55.9656770Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:55.9713999Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:55.9772613Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:55.9830333Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:55.9886736Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:55.9944192Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:55.9999120Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:56.0058058Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:56.0119843Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:56.0201134Z Entering 'third_party/pocketfft' 2025-12-04T09:18:56.0264808Z Entering 'third_party/protobuf' 2025-12-04T09:18:56.0327261Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:56.0384004Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:56.0445523Z Entering 'third_party/psimd' 2025-12-04T09:18:56.0504802Z Entering 'third_party/pthreadpool' 2025-12-04T09:18:56.0567598Z Entering 'third_party/pybind11' 2025-12-04T09:18:56.0627789Z Entering 'third_party/python-peachpy' 2025-12-04T09:18:56.0691724Z Entering 'third_party/sleef' 2025-12-04T09:18:56.0752803Z Entering 'third_party/tensorpipe' 2025-12-04T09:18:56.0816542Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:56.0872235Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:56.0930409Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:56.0987442Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:56.1050004Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:56.1134283Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T09:18:56.1541934Z Entering 'android/libs/fbjni' 2025-12-04T09:18:56.1603732Z Entering 'third_party/FP16' 2025-12-04T09:18:56.1664328Z Entering 'third_party/FXdiv' 2025-12-04T09:18:56.1726759Z Entering 'third_party/NNPACK' 2025-12-04T09:18:56.1787889Z Entering 'third_party/NVTX' 2025-12-04T09:18:56.1850562Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:18:56.1912588Z Entering 'third_party/XNNPACK' 2025-12-04T09:18:56.1988025Z Entering 'third_party/aiter' 2025-12-04T09:18:56.2050718Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:18:56.2122193Z Entering 'third_party/benchmark' 2025-12-04T09:18:56.2183415Z Entering 'third_party/composable_kernel' 2025-12-04T09:18:56.2257297Z Entering 'third_party/cpp-httplib' 2025-12-04T09:18:56.2320490Z Entering 'third_party/cpuinfo' 2025-12-04T09:18:56.2380935Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:18:56.2442932Z Entering 'third_party/cutlass' 2025-12-04T09:18:56.2513401Z Entering 'third_party/fbgemm' 2025-12-04T09:18:56.2576775Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:18:56.2636413Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:18:56.2710779Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:18:56.2768732Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:18:56.2836226Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:18:56.2895712Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:18:56.2955675Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:18:56.3018521Z Entering 'third_party/flash-attention' 2025-12-04T09:18:56.3077763Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:18:56.3145102Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:18:56.3220421Z Entering 'third_party/flatbuffers' 2025-12-04T09:18:56.3284475Z Entering 'third_party/fmt' 2025-12-04T09:18:56.3345412Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:18:56.3406995Z Entering 'third_party/gloo' 2025-12-04T09:18:56.3469330Z Entering 'third_party/googletest' 2025-12-04T09:18:56.3532637Z Entering 'third_party/ideep' 2025-12-04T09:18:56.3590692Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:56.3658175Z Entering 'third_party/ittapi' 2025-12-04T09:18:56.3720539Z Entering 'third_party/kineto' 2025-12-04T09:18:56.3781157Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:56.3838884Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:56.3897548Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:56.3956457Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:56.4014519Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:56.4071811Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:56.4134144Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:56.4191829Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:56.4251907Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:56.4309977Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:56.4372845Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:56.4429579Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:56.4490232Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:56.4563967Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:56.4614162Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:56.4674348Z Entering 'third_party/kleidiai' 2025-12-04T09:18:56.4736068Z Entering 'third_party/mimalloc' 2025-12-04T09:18:56.4795386Z Entering 'third_party/nlohmann' 2025-12-04T09:18:56.4857625Z Entering 'third_party/onnx' 2025-12-04T09:18:56.4933710Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:56.4996745Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:18:56.5058255Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:56.5119083Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:56.5176589Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:56.5233760Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:56.5295183Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:56.5352080Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:56.5409655Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:56.5464777Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:56.5524907Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:56.5585285Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:56.5666489Z Entering 'third_party/pocketfft' 2025-12-04T09:18:56.5727495Z Entering 'third_party/protobuf' 2025-12-04T09:18:56.5790550Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:56.5854033Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:56.5915645Z Entering 'third_party/psimd' 2025-12-04T09:18:56.5977786Z Entering 'third_party/pthreadpool' 2025-12-04T09:18:56.6042566Z Entering 'third_party/pybind11' 2025-12-04T09:18:56.6101765Z Entering 'third_party/python-peachpy' 2025-12-04T09:18:56.6161830Z Entering 'third_party/sleef' 2025-12-04T09:18:56.6221282Z Entering 'third_party/tensorpipe' 2025-12-04T09:18:56.6280477Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:56.6337631Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:56.6398596Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:56.6457402Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:56.6514746Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:56.6596358Z ##[endgroup] 2025-12-04T09:18:56.6645111Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T09:18:56.6677499Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:18:56.6810268Z ##[group]Run cd "${GITHUB_WORKSPACE}" 2025-12-04T09:18:56.6810624Z cd "${GITHUB_WORKSPACE}" 2025-12-04T09:18:56.6810920Z # Clean stale submodule dirs 2025-12-04T09:18:56.6811226Z if [ -z "${NO_SUDO}" ]; then 2025-12-04T09:18:56.6811597Z  sudo git submodule foreach --recursive git clean -ffdx 2025-12-04T09:18:56.6811953Z else 2025-12-04T09:18:56.6812246Z  git submodule foreach --recursive git clean -ffdx 2025-12-04T09:18:56.6812590Z fi 2025-12-04T09:18:56.6824163Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:18:56.6824524Z env: 2025-12-04T09:18:56.6824742Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:56.6824984Z NO_SUDO: true 2025-12-04T09:18:56.6825230Z ##[endgroup] 2025-12-04T09:18:56.7251841Z Entering 'android/libs/fbjni' 2025-12-04T09:18:56.7299146Z Entering 'third_party/FP16' 2025-12-04T09:18:56.7346284Z Entering 'third_party/FXdiv' 2025-12-04T09:18:56.7391035Z Entering 'third_party/NNPACK' 2025-12-04T09:18:56.7442654Z Entering 'third_party/NVTX' 2025-12-04T09:18:56.7497278Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:18:56.7544489Z Entering 'third_party/XNNPACK' 2025-12-04T09:18:56.7702874Z Entering 'third_party/aiter' 2025-12-04T09:18:56.7760943Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:18:56.7909789Z Entering 'third_party/benchmark' 2025-12-04T09:18:56.7957365Z Entering 'third_party/composable_kernel' 2025-12-04T09:18:56.8116987Z Entering 'third_party/cpp-httplib' 2025-12-04T09:18:56.8164961Z Entering 'third_party/cpuinfo' 2025-12-04T09:18:56.8217531Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:18:56.8268669Z Entering 'third_party/cutlass' 2025-12-04T09:18:56.8403261Z Entering 'third_party/fbgemm' 2025-12-04T09:18:56.8485333Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:18:56.8530609Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:18:56.8691940Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:18:56.8741252Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:18:56.8872862Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:18:56.8919926Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:18:56.8962140Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:18:56.9026006Z Entering 'third_party/flash-attention' 2025-12-04T09:18:56.9081479Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:18:56.9214673Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:18:56.9332489Z Entering 'third_party/flatbuffers' 2025-12-04T09:18:56.9431863Z Entering 'third_party/fmt' 2025-12-04T09:18:56.9479248Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:18:56.9527279Z Entering 'third_party/gloo' 2025-12-04T09:18:56.9575249Z Entering 'third_party/googletest' 2025-12-04T09:18:56.9624344Z Entering 'third_party/ideep' 2025-12-04T09:18:56.9667370Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:18:56.9781816Z Entering 'third_party/ittapi' 2025-12-04T09:18:56.9831674Z Entering 'third_party/kineto' 2025-12-04T09:18:56.9881111Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:18:56.9933120Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:18:56.9997400Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:18:57.0046312Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:18:57.0091948Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:18:57.0134610Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:18:57.0181006Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:18:57.0227060Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:18:57.0278206Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:18:57.0334989Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:18:57.0378873Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:18:57.0423625Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:57.0494116Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:57.0551860Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:18:57.0595618Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:18:57.0645145Z Entering 'third_party/kleidiai' 2025-12-04T09:18:57.0703753Z Entering 'third_party/mimalloc' 2025-12-04T09:18:57.0751498Z Entering 'third_party/nlohmann' 2025-12-04T09:18:57.0815474Z Entering 'third_party/onnx' 2025-12-04T09:18:57.1285183Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:18:57.1338092Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:18:57.1419250Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:18:57.1462849Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:18:57.1514520Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:18:57.1556166Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:18:57.1614823Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:18:57.1661779Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:18:57.1706585Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:18:57.1749491Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:18:57.1814442Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:18:57.1864052Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:18:57.2224267Z Entering 'third_party/pocketfft' 2025-12-04T09:18:57.2267506Z Entering 'third_party/protobuf' 2025-12-04T09:18:57.2375524Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:18:57.2420866Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:18:57.2474414Z Entering 'third_party/psimd' 2025-12-04T09:18:57.2519384Z Entering 'third_party/pthreadpool' 2025-12-04T09:18:57.2563901Z Entering 'third_party/pybind11' 2025-12-04T09:18:57.2613729Z Entering 'third_party/python-peachpy' 2025-12-04T09:18:57.2660380Z Entering 'third_party/sleef' 2025-12-04T09:18:57.2710427Z Entering 'third_party/tensorpipe' 2025-12-04T09:18:57.2758716Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:18:57.2804807Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:18:57.2848077Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:18:57.2899380Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:18:57.2941419Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:18:57.3115220Z Prepare all required actions 2025-12-04T09:18:57.3115927Z Getting action download info 2025-12-04T09:18:57.4559002Z ##[group]Run ./.github/actions/setup-linux 2025-12-04T09:18:57.4559292Z env: 2025-12-04T09:18:57.4559582Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:57.4559828Z ##[endgroup] 2025-12-04T09:18:57.4596427Z ##[group]Run set -euo pipefail 2025-12-04T09:18:57.4596747Z set -euo pipefail 2025-12-04T09:18:57.4597028Z function get_ec2_metadata() { 2025-12-04T09:18:57.4597390Z  # Pulled from instance metadata endpoint for EC2 2025-12-04T09:18:57.4597979Z  # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html 2025-12-04T09:18:57.4598915Z  category=$1 2025-12-04T09:18:57.4599543Z  # If it is GCP runner (runner name contains gcp), do not run this 2025-12-04T09:18:57.4599951Z  runner_name_str=i-0513695dee1ce902e 2025-12-04T09:18:57.4600500Z  if [[ -f /.inarc ]]; then 2025-12-04T09:18:57.4601213Z  echo "ARC Runner, no info on ec2 metadata" 2025-12-04T09:18:57.4601703Z  elif [[ $runner_name_str == *"gcp"* ]]; then 2025-12-04T09:18:57.4602284Z  echo "Runner is from Google Cloud Platform, No info on ec2 metadata" 2025-12-04T09:18:57.4602689Z  else 2025-12-04T09:18:57.4603485Z  curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}" 2025-12-04T09:18:57.4604315Z  fi 2025-12-04T09:18:57.4604520Z } 2025-12-04T09:18:57.4604772Z echo "ami-id: $(get_ec2_metadata ami-id)" 2025-12-04T09:18:57.4605172Z echo "instance-id: $(get_ec2_metadata instance-id)" 2025-12-04T09:18:57.4605625Z echo "instance-type: $(get_ec2_metadata instance-type)" 2025-12-04T09:18:57.4606018Z echo "system info $(uname -a)" 2025-12-04T09:18:57.4616855Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:18:57.4617223Z env: 2025-12-04T09:18:57.4617438Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:57.4617681Z ##[endgroup] 2025-12-04T09:18:57.4787331Z ami-id: ami-08982f1c5bf93d976 2025-12-04T09:18:57.4919398Z instance-id: i-0513695dee1ce902e 2025-12-04T09:18:57.5044378Z instance-type: g5.4xlarge 2025-12-04T09:18:57.5059718Z system info Linux ip-10-0-37-220.ec2.internal 6.1.150-174.273.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Sep 9 12:21:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-12-04T09:18:57.5083273Z ##[group]Run if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi 2025-12-04T09:18:57.5083745Z if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi 2025-12-04T09:18:57.5093590Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:18:57.5093948Z env: 2025-12-04T09:18:57.5094160Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:57.5094416Z ##[endgroup] 2025-12-04T09:18:59.1173440Z Thu Dec 4 09:18:59 2025 2025-12-04T09:18:59.1173849Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:18:59.1174352Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:18:59.1174850Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:18:59.1175362Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:18:59.1175945Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:18:59.1176385Z | | | MIG M. | 2025-12-04T09:18:59.1176737Z |=========================================+========================+======================| 2025-12-04T09:18:59.1271113Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-12-04T09:18:59.1271970Z | 0% 26C P0 52W / 300W | 0MiB / 23028MiB | 3% Default | 2025-12-04T09:18:59.1272360Z | | | N/A | 2025-12-04T09:18:59.1272756Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:18:59.1273041Z 2025-12-04T09:18:59.1273275Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:18:59.1273712Z | Processes: | 2025-12-04T09:18:59.1274157Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:18:59.1274579Z | ID ID Usage | 2025-12-04T09:18:59.1275091Z |=========================================================================================| 2025-12-04T09:18:59.1276905Z | No running processes found | 2025-12-04T09:18:59.1277385Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:18:59.5520864Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:18:59.5521727Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:18:59.5535173Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:18:59.5535550Z env: 2025-12-04T09:18:59.5535804Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:59.5536052Z ##[endgroup] 2025-12-04T09:18:59.5598870Z ##[group]Run if systemctl is-active --quiet docker; then 2025-12-04T09:18:59.5599285Z if systemctl is-active --quiet docker; then 2025-12-04T09:18:59.5599765Z  echo "Docker daemon is running..."; 2025-12-04T09:18:59.5600090Z else 2025-12-04T09:18:59.5600774Z  echo "Starting docker daemon..." && sudo systemctl start docker; 2025-12-04T09:18:59.5601179Z fi 2025-12-04T09:18:59.5610070Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:18:59.5610426Z env: 2025-12-04T09:18:59.5610628Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:59.5610878Z ##[endgroup] 2025-12-04T09:18:59.5720470Z Docker daemon is running... 2025-12-04T09:18:59.5763597Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T09:18:59.5763898Z with: 2025-12-04T09:18:59.5764093Z shell: bash 2025-12-04T09:18:59.5764321Z timeout_minutes: 5 2025-12-04T09:18:59.5764618Z max_attempts: 3 2025-12-04T09:18:59.5764845Z retry_wait_seconds: 30 2025-12-04T09:18:59.5766953Z command: AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" # For LF Runners we need to make sure we also login to Meta's ECR docker registry too. META_AWS_ACCOUNT_ID=308535385114 if [ "$AWS_ACCOUNT_ID" != "$META_AWS_ACCOUNT_ID" ] ; then aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$META_AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" fi 2025-12-04T09:18:59.5769063Z polling_interval_seconds: 1 2025-12-04T09:18:59.5769349Z warning_on_retry: true 2025-12-04T09:18:59.5769604Z continue_on_error: false 2025-12-04T09:18:59.5769855Z env: 2025-12-04T09:18:59.5770071Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:18:59.5770331Z AWS_RETRY_MODE: standard 2025-12-04T09:18:59.5770581Z AWS_MAX_ATTEMPTS: 5 2025-12-04T09:18:59.5770836Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:18:59.5771111Z ##[endgroup] 2025-12-04T09:19:00.7331004Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:19:00.7332230Z Configure a credential helper to remove this warning. See 2025-12-04T09:19:00.7332783Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:19:00.7333161Z 2025-12-04T09:19:00.7333253Z Login Succeeded 2025-12-04T09:19:01.6631654Z Command completed after 1 attempt(s). 2025-12-04T09:19:01.6703550Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:19:01.6704047Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:19:01.6704491Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:19:01.6714460Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:01.6714811Z env: 2025-12-04T09:19:01.6715024Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:01.6715275Z ##[endgroup] 2025-12-04T09:19:01.6845776Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T09:19:01.6846304Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T09:19:01.6846725Z # shellcheck disable=SC2046 2025-12-04T09:19:01.6847049Z docker stop $(docker ps -q) || true 2025-12-04T09:19:01.6847376Z # Prune all of the docker images 2025-12-04T09:19:01.6847685Z docker system prune -af 2025-12-04T09:19:01.6857404Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:01.6857757Z env: 2025-12-04T09:19:01.6857961Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:01.6858210Z ##[endgroup] 2025-12-04T09:19:01.7187846Z "docker stop" requires at least 1 argument. 2025-12-04T09:19:01.7188293Z See 'docker stop --help'. 2025-12-04T09:19:01.7188464Z 2025-12-04T09:19:01.7188634Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2025-12-04T09:19:01.7188891Z 2025-12-04T09:19:01.7188999Z Stop one or more running containers 2025-12-04T09:19:01.7508949Z Total reclaimed space: 0B 2025-12-04T09:19:01.7708796Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2025-12-04T09:19:01.7709239Z with: 2025-12-04T09:19:01.7709988Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7710829Z use-custom-docker-registry: true 2025-12-04T09:19:01.7711126Z docker-build-dir: .ci/docker 2025-12-04T09:19:01.7711418Z docker-build-script: ./build.sh 2025-12-04T09:19:01.7711702Z working-directory: . 2025-12-04T09:19:01.7712033Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:01.7712412Z force-push: false 2025-12-04T09:19:01.7712628Z env: 2025-12-04T09:19:01.7712828Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:01.7713072Z ##[endgroup] 2025-12-04T09:19:01.7731899Z ##[group]Run set -ex 2025-12-04T09:19:01.7732183Z set -ex 2025-12-04T09:19:01.7732421Z  2025-12-04T09:19:01.7732827Z # If the docker build directory or the build script doesn't exist, the action will 2025-12-04T09:19:01.7733457Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2025-12-04T09:19:01.7733996Z # job could then download the pre-built image as usual 2025-12-04T09:19:01.7734636Z if [[ -d "${DOCKER_BUILD_DIR}" ]] && [[ -f "${DOCKER_BUILD_DIR}/${DOCKER_BUILD_SCRIPT}" ]] && [[ "${USE_CUSTOM_DOCKER_REGISTRY}" == "true" ]]; then 2025-12-04T09:19:01.7735234Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7735553Z else 2025-12-04T09:19:01.7735809Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7736237Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7736632Z  2025-12-04T09:19:01.7737205Z  echo "Not using custom ECR registry. Either it was not requested or there is no Docker build script in the ${REPO_NAME} repo..." 2025-12-04T09:19:01.7737800Z  exit 0 2025-12-04T09:19:01.7738015Z fi 2025-12-04T09:19:01.7738219Z  2025-12-04T09:19:01.7738546Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2025-12-04T09:19:01.7739114Z  # The docker image name already includes the ECR prefix and tag, so we can just 2025-12-04T09:19:01.7739614Z  # use it as it is, but first let's extract the tag 2025-12-04T09:19:01.7740075Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2025-12-04T09:19:01.7740555Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7741021Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7741398Z else 2025-12-04T09:19:01.7741657Z  if [[ "${DOCKER_IMAGE_NAME}" == *:* ]]; then 2025-12-04T09:19:01.7742023Z  CUSTOM_TAG_PREFIX=${DOCKER_IMAGE_NAME#*:} 2025-12-04T09:19:01.7742582Z  DOCKER_IMAGE_NAME=${DOCKER_IMAGE_NAME%%:*} 2025-12-04T09:19:01.7742904Z  fi 2025-12-04T09:19:01.7743336Z  DOCKER_TAG=${CUSTOM_TAG_PREFIX:+${CUSTOM_TAG_PREFIX}-}$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2025-12-04T09:19:01.7743916Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7744514Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7745168Z  echo "custom-tag-prefix=${CUSTOM_TAG_PREFIX}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7745575Z fi 2025-12-04T09:19:01.7754611Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:01.7754979Z env: 2025-12-04T09:19:01.7755196Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:01.7755455Z REPO_NAME: pytorch 2025-12-04T09:19:01.7756403Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7757252Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T09:19:01.7757533Z DOCKER_BUILD_SCRIPT: ./build.sh 2025-12-04T09:19:01.7757894Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:01.7758289Z USE_CUSTOM_DOCKER_REGISTRY: true 2025-12-04T09:19:01.7758579Z CUSTOM_TAG_PREFIX: 2025-12-04T09:19:01.7758815Z ##[endgroup] 2025-12-04T09:19:01.7790270Z + [[ -d .ci/docker ]] 2025-12-04T09:19:01.7790710Z + [[ -f .ci/docker/./build.sh ]] 2025-12-04T09:19:01.7791135Z + [[ true == \t\r\u\e ]] 2025-12-04T09:19:01.7791514Z + echo skip=false 2025-12-04T09:19:01.7792916Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2025-12-04T09:19:01.7800143Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7801554Z ++ awk -F '[:,]' '{print $2}' 2025-12-04T09:19:01.7830461Z + DOCKER_TAG=pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7831559Z + echo docker-tag=pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7832672Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7856360Z ##[group]Run set +e 2025-12-04T09:19:01.7856661Z set +e 2025-12-04T09:19:01.7856877Z set -x 2025-12-04T09:19:01.7857086Z  2025-12-04T09:19:01.7857292Z login() { 2025-12-04T09:19:01.7857730Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T09:19:01.7858234Z } 2025-12-04T09:19:01.7858449Z  2025-12-04T09:19:01.7858645Z retry () { 2025-12-04T09:19:01.7858903Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T09:19:01.7859200Z } 2025-12-04T09:19:01.7859398Z  2025-12-04T09:19:01.7859621Z retry login "${DOCKER_REGISTRY}" 2025-12-04T09:19:01.7859911Z  2025-12-04T09:19:01.7860124Z START_TIME=$(date +%s) 2025-12-04T09:19:01.7860399Z # Wait up to 120 minutes 2025-12-04T09:19:01.7860744Z while [[ $(( $(date +%s) - 7200 )) -lt $START_TIME ]]; do 2025-12-04T09:19:01.7861205Z  # Check if image already exists, if it does then skip building it 2025-12-04T09:19:01.7861661Z  if docker manifest inspect "${DOCKER_IMAGE}"; then 2025-12-04T09:19:01.7862007Z  exit 0 2025-12-04T09:19:01.7862233Z  fi 2025-12-04T09:19:01.7862433Z  2025-12-04T09:19:01.7862976Z  # NB: This flag is used by Docker build workflow to push the image to ECR, so we can 2025-12-04T09:19:01.7863602Z  # use this to differentiate between the Docker build and regular build jobs. For the 2025-12-04T09:19:01.7864223Z  # latter, it will wait for the Docker images to become available before continuing 2025-12-04T09:19:01.7864708Z  if [ "${DOCKER_PUSH:-false}" == "true" ]; then 2025-12-04T09:19:01.7865091Z  # It's a Docker build job, let's build the image 2025-12-04T09:19:01.7865419Z  break 2025-12-04T09:19:01.7865640Z  else 2025-12-04T09:19:01.7865958Z  # It's a regular build job, wait for the image to become available 2025-12-04T09:19:01.7866349Z  sleep 300 2025-12-04T09:19:01.7866584Z  fi 2025-12-04T09:19:01.7866789Z done 2025-12-04T09:19:01.7866991Z  2025-12-04T09:19:01.7867318Z # NB: This part requires a full checkout. Otherwise, the merge base will 2025-12-04T09:19:01.7868002Z # be empty. The default action would be to continue rebuild the image 2025-12-04T09:19:01.7868486Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2025-12-04T09:19:01.7868919Z  # if we're on the base branch then use the parent commit 2025-12-04T09:19:01.7869297Z  MERGE_BASE=$(git rev-parse HEAD~) 2025-12-04T09:19:01.7869584Z else 2025-12-04T09:19:01.7869888Z  # otherwise we're on a PR, so use the most recent base commit 2025-12-04T09:19:01.7870328Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2025-12-04T09:19:01.7870661Z fi 2025-12-04T09:19:01.7870860Z  2025-12-04T09:19:01.7871081Z if [[ -z "${MERGE_BASE}" ]]; then 2025-12-04T09:19:01.7871414Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7871718Z  2025-12-04T09:19:01.7872151Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2025-12-04T09:19:01.7872662Z  exit 0 2025-12-04T09:19:01.7872867Z fi 2025-12-04T09:19:01.7873067Z  2025-12-04T09:19:01.7873354Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2025-12-04T09:19:01.7873994Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2025-12-04T09:19:01.7874531Z  exit 1 2025-12-04T09:19:01.7874744Z fi 2025-12-04T09:19:01.7874939Z  2025-12-04T09:19:01.7875271Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2025-12-04T09:19:01.7875879Z # If no image exists but the hash is the same as the previous hash then we should error out here 2025-12-04T09:19:01.7876438Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2025-12-04T09:19:01.7877111Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2025-12-04T09:19:01.7877830Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2025-12-04T09:19:01.7878257Z fi 2025-12-04T09:19:01.7878459Z  2025-12-04T09:19:01.7878700Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T09:19:01.7887181Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:01.7887531Z env: 2025-12-04T09:19:01.7887744Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:01.7887998Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T09:19:01.7888324Z BASE_REVISION: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:19:01.7889193Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7890249Z DOCKER_TAG: pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:01.7890985Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:01.7891363Z DOCKER_PUSH: 2025-12-04T09:19:01.7891582Z ##[endgroup] 2025-12-04T09:19:01.7921467Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:01.7921910Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:01.7924991Z + aws ecr get-login-password --region us-east-1 2025-12-04T09:19:01.7926192Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:02.3192945Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:19:02.3193531Z Configure a credential helper to remove this warning. See 2025-12-04T09:19:02.3194071Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:19:02.3194440Z 2025-12-04T09:19:02.3194578Z Login Succeeded 2025-12-04T09:19:02.3222787Z ++ date +%s 2025-12-04T09:19:02.3237011Z + START_TIME=1764839942 2025-12-04T09:19:02.3240733Z ++ date +%s 2025-12-04T09:19:02.3253723Z + [[ 1764832742 -lt 1764839942 ]] 2025-12-04T09:19:02.3254658Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:02.5587015Z { 2025-12-04T09:19:02.5587352Z "schemaVersion": 2, 2025-12-04T09:19:02.5587847Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2025-12-04T09:19:02.5588266Z "config": { 2025-12-04T09:19:02.5588590Z "mediaType": "application/vnd.docker.container.image.v1+json", 2025-12-04T09:19:02.5588972Z "size": 34864, 2025-12-04T09:19:02.5589352Z "digest": "sha256:add7313791033822205cdb3cf32096534b2cfaa4855bd48119b59000bfe00301" 2025-12-04T09:19:02.5589784Z }, 2025-12-04T09:19:02.5589970Z "layers": [ 2025-12-04T09:19:02.5590159Z { 2025-12-04T09:19:02.5590472Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5590865Z "size": 30447951, 2025-12-04T09:19:02.5591277Z "digest": "sha256:63e5bc7682b85ae57a1221210f64d62e7a90b0a30f19af4ca734b8242ae49d63" 2025-12-04T09:19:02.5591714Z }, 2025-12-04T09:19:02.5591894Z { 2025-12-04T09:19:02.5592196Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5592587Z "size": 1554, 2025-12-04T09:19:02.5592963Z "digest": "sha256:0678d56345c994444b77bb70b1177189d23e794748b1d75ffc45d227c7dea94a" 2025-12-04T09:19:02.5593387Z }, 2025-12-04T09:19:02.5593561Z { 2025-12-04T09:19:02.5593867Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5594269Z "size": 313275661, 2025-12-04T09:19:02.5594670Z "digest": "sha256:45f5c9ddfce78349dff3d5edfbaa0310ae17311f66abdcd7e00fa21b500e801c" 2025-12-04T09:19:02.5595116Z }, 2025-12-04T09:19:02.5595306Z { 2025-12-04T09:19:02.5595608Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5595992Z "size": 787, 2025-12-04T09:19:02.5596391Z "digest": "sha256:086b1df51ac1162d9c45698e9dfaf91c6c222c8bd9ab01797ac8f9344bc8044f" 2025-12-04T09:19:02.5596833Z }, 2025-12-04T09:19:02.5597014Z { 2025-12-04T09:19:02.5597321Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5597701Z "size": 106, 2025-12-04T09:19:02.5609978Z "digest": "sha256:fe8a7b64bf98352f89057bcba66beef2fb44cc05fbd3606abccd8e86cf476234" 2025-12-04T09:19:02.5610587Z }, 2025-12-04T09:19:02.5610777Z { 2025-12-04T09:19:02.5611098Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5611488Z "size": 703, 2025-12-04T09:19:02.5611874Z "digest": "sha256:7680723e9a578033dd106b45784c639f06cc8adb1f5239ec513d9de01087c1af" 2025-12-04T09:19:02.5612379Z }, 2025-12-04T09:19:02.5612566Z { 2025-12-04T09:19:02.5612872Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5613266Z "size": 1216, 2025-12-04T09:19:02.5613649Z "digest": "sha256:9c5027aeeb4e3101f48c1d2e400c387110e1009e42497ee801f1b4b7f7edb5c0" 2025-12-04T09:19:02.5614346Z }, 2025-12-04T09:19:02.5614536Z { 2025-12-04T09:19:02.5614848Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5615231Z "size": 483, 2025-12-04T09:19:02.5615603Z "digest": "sha256:9a56521103600bd37a1e7c1191b5136c2d738c092f8a6701499f7068a32c2628" 2025-12-04T09:19:02.5616030Z }, 2025-12-04T09:19:02.5616207Z { 2025-12-04T09:19:02.5616520Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5616934Z "size": 110361875, 2025-12-04T09:19:02.5617323Z "digest": "sha256:375c4427e9141269458333b1463fdb219e736fd6231ec1c56c625c48437ace77" 2025-12-04T09:19:02.5617743Z }, 2025-12-04T09:19:02.5617926Z { 2025-12-04T09:19:02.5618236Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5618615Z "size": 4961, 2025-12-04T09:19:02.5619004Z "digest": "sha256:a86faaa7dbdd70e678e5ea20072637ee42618921ca8f80ca089f789325d4b0c2" 2025-12-04T09:19:02.5619452Z }, 2025-12-04T09:19:02.5619673Z { 2025-12-04T09:19:02.5620200Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5620591Z "size": 1755, 2025-12-04T09:19:02.5620974Z "digest": "sha256:fb7848686804957915d98f8655ef6da0fe4c521b50a82aefdebf475983505a15" 2025-12-04T09:19:02.5621406Z }, 2025-12-04T09:19:02.5621588Z { 2025-12-04T09:19:02.5621898Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5622275Z "size": 724, 2025-12-04T09:19:02.5622651Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:19:02.5623079Z }, 2025-12-04T09:19:02.5623261Z { 2025-12-04T09:19:02.5623563Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5623946Z "size": 543, 2025-12-04T09:19:02.5624325Z "digest": "sha256:79dc80f426b29d4ae9157b967050b03e66aa0c4b1295b944a1dd70106be87066" 2025-12-04T09:19:02.5624749Z }, 2025-12-04T09:19:02.5624939Z { 2025-12-04T09:19:02.5625254Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5625637Z "size": 3185190117, 2025-12-04T09:19:02.5626049Z "digest": "sha256:a13fcc1b90bb9c251ebe7ef2a03c4cb3afa1c8bdafe84f5f85136773059a3735" 2025-12-04T09:19:02.5626540Z }, 2025-12-04T09:19:02.5626720Z { 2025-12-04T09:19:02.5627027Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5627420Z "size": 32, 2025-12-04T09:19:02.5627798Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5628240Z }, 2025-12-04T09:19:02.5628423Z { 2025-12-04T09:19:02.5628731Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5629112Z "size": 396, 2025-12-04T09:19:02.5629489Z "digest": "sha256:549db4d6c618ecd9534658a233e3c90508f82d8735f965c2786b2eaa078869e5" 2025-12-04T09:19:02.5629921Z }, 2025-12-04T09:19:02.5630097Z { 2025-12-04T09:19:02.5630416Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5630808Z "size": 236860, 2025-12-04T09:19:02.5631183Z "digest": "sha256:5c63528cb580001e65104f4cb0809bf0673a00f989a7db42fd6d86aa1ec27cee" 2025-12-04T09:19:02.5631614Z }, 2025-12-04T09:19:02.5631796Z { 2025-12-04T09:19:02.5632100Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5632490Z "size": 231, 2025-12-04T09:19:02.5632875Z "digest": "sha256:75bd83b989a44e4d4119a3f972891025eb0e9ce95cfbe4a0ca5cdbe7130028d6" 2025-12-04T09:19:02.5633315Z }, 2025-12-04T09:19:02.5633491Z { 2025-12-04T09:19:02.5633801Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5634192Z "size": 3043497, 2025-12-04T09:19:02.5634571Z "digest": "sha256:de6e78970f517178cb91f36cd02bd9ca7b72a08fb82a0f9007516026f258c035" 2025-12-04T09:19:02.5635006Z }, 2025-12-04T09:19:02.5635188Z { 2025-12-04T09:19:02.5635491Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5635978Z "size": 1472, 2025-12-04T09:19:02.5636386Z "digest": "sha256:e13ed7c7e4736e81dc21af755b3363eb26e4d3b2f1ca988dfe65effa47d8fa42" 2025-12-04T09:19:02.5636859Z }, 2025-12-04T09:19:02.5637039Z { 2025-12-04T09:19:02.5637349Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5637730Z "size": 481, 2025-12-04T09:19:02.5638117Z "digest": "sha256:6e2949bcb74152577a0f20c38bcb6dd80f5e68427e3e531a80e08c9ecc73a979" 2025-12-04T09:19:02.5638557Z }, 2025-12-04T09:19:02.5638736Z { 2025-12-04T09:19:02.5639035Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5639492Z "size": 202, 2025-12-04T09:19:02.5639907Z "digest": "sha256:14d69d9aaec70287efd2fd35c4f93e43a29a4098458cc9fca1c93f02ad7356cb" 2025-12-04T09:19:02.5640343Z }, 2025-12-04T09:19:02.5640522Z { 2025-12-04T09:19:02.5640833Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5641216Z "size": 607, 2025-12-04T09:19:02.5641706Z "digest": "sha256:5c02769dd8e5bba2f7f5fd84bde9595fcb3bdbffcae497503fa846f9b5e78bf5" 2025-12-04T09:19:02.5642161Z }, 2025-12-04T09:19:02.5642337Z { 2025-12-04T09:19:02.5642643Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5643040Z "size": 7889619584, 2025-12-04T09:19:02.5643441Z "digest": "sha256:35041ce524ac4afec40ecd73b1393c830614f1f79d43a6439767a6c7d5b7027b" 2025-12-04T09:19:02.5643867Z }, 2025-12-04T09:19:02.5644045Z { 2025-12-04T09:19:02.5644358Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5644739Z "size": 830, 2025-12-04T09:19:02.5645117Z "digest": "sha256:2fa92dc5885e080e049ceb4139288b6c0e39fab34256945708b08ea55a1f7a0b" 2025-12-04T09:19:02.5645552Z }, 2025-12-04T09:19:02.5645730Z { 2025-12-04T09:19:02.5646041Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5646429Z "size": 33451739, 2025-12-04T09:19:02.5646829Z "digest": "sha256:2b85eafbd92a0e70a0a70154ad8bf4584095e576d95873368f30373f5966714a" 2025-12-04T09:19:02.5647267Z }, 2025-12-04T09:19:02.5647449Z { 2025-12-04T09:19:02.5647750Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5648142Z "size": 104, 2025-12-04T09:19:02.5648531Z "digest": "sha256:ff755a4ddad7880f23c6b767d432d6f1eafdb62b3ea18f8a98e22c441c099fcb" 2025-12-04T09:19:02.5648981Z }, 2025-12-04T09:19:02.5649157Z { 2025-12-04T09:19:02.5649470Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5649857Z "size": 1496, 2025-12-04T09:19:02.5650228Z "digest": "sha256:09eb41bdf42d8605b57b2363348154140904dec914b34a67298b82122bfce2b3" 2025-12-04T09:19:02.5650658Z }, 2025-12-04T09:19:02.5650836Z { 2025-12-04T09:19:02.5651133Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5651524Z "size": 458787828, 2025-12-04T09:19:02.5651913Z "digest": "sha256:11ede4d59e935e62f41b33220fe871794ab5e57ce724173b713368977683bcf6" 2025-12-04T09:19:02.5652345Z }, 2025-12-04T09:19:02.5652526Z { 2025-12-04T09:19:02.5652834Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5653213Z "size": 164, 2025-12-04T09:19:02.5653589Z "digest": "sha256:1283cd8f801a142172f3ab76fd472df8583223d9437de3e4d18d8cf98ea3fa98" 2025-12-04T09:19:02.5654020Z }, 2025-12-04T09:19:02.5654203Z { 2025-12-04T09:19:02.5654502Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5654889Z "size": 346, 2025-12-04T09:19:02.5655261Z "digest": "sha256:024fa855425fa524ad4500660cf61d53be62b99556d31b8b280d14caba434a35" 2025-12-04T09:19:02.5655684Z }, 2025-12-04T09:19:02.5655862Z { 2025-12-04T09:19:02.5656168Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5656596Z "size": 32, 2025-12-04T09:19:02.5656977Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5657499Z }, 2025-12-04T09:19:02.5657676Z { 2025-12-04T09:19:02.5657981Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5658364Z "size": 106, 2025-12-04T09:19:02.5658749Z "digest": "sha256:303e6747a62efecf5efa1f97d0e66b40a3b39da8d79a51f75b89f4c92ae7ec52" 2025-12-04T09:19:02.5659180Z }, 2025-12-04T09:19:02.5659355Z { 2025-12-04T09:19:02.5659659Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5660036Z "size": 424, 2025-12-04T09:19:02.5660420Z "digest": "sha256:3017cdf4838bcc9a33daebc07487f8ae1f6bd6e7ce8322c14f5480e8db9ef90e" 2025-12-04T09:19:02.5660860Z }, 2025-12-04T09:19:02.5661031Z { 2025-12-04T09:19:02.5661334Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5661721Z "size": 19309374, 2025-12-04T09:19:02.5662111Z "digest": "sha256:6b6cd1c358e886dc6ed7fd46ac4bcc1a0a73b7b1301739ea1953478ee5d83f50" 2025-12-04T09:19:02.5662557Z }, 2025-12-04T09:19:02.5662733Z { 2025-12-04T09:19:02.5663124Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5663513Z "size": 108, 2025-12-04T09:19:02.5663890Z "digest": "sha256:b2dd045011241d1cf8889e2a7369d9fe4844dfe15529b520ccd6a59bd3c1532e" 2025-12-04T09:19:02.5664319Z }, 2025-12-04T09:19:02.5664490Z { 2025-12-04T09:19:02.5664794Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5665178Z "size": 827, 2025-12-04T09:19:02.5665547Z "digest": "sha256:55adc51fe5897031d4cf2f2b8fd162213f6e46a52848630c616606271b97952e" 2025-12-04T09:19:02.5665979Z }, 2025-12-04T09:19:02.5666157Z { 2025-12-04T09:19:02.5666479Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5666888Z "size": 724, 2025-12-04T09:19:02.5667256Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:19:02.5667671Z }, 2025-12-04T09:19:02.5667859Z { 2025-12-04T09:19:02.5668173Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5668562Z "size": 149, 2025-12-04T09:19:02.5668927Z "digest": "sha256:a43ca0e4b837964b12b7469194cfe939c26de027298040028975324dce25938a" 2025-12-04T09:19:02.5669351Z }, 2025-12-04T09:19:02.5669531Z { 2025-12-04T09:19:02.5669829Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5670210Z "size": 138, 2025-12-04T09:19:02.5670584Z "digest": "sha256:b7212f17fd1404837fcfdd086dd0e2667931e4db377d45d8d89a44390c84e11d" 2025-12-04T09:19:02.5671008Z }, 2025-12-04T09:19:02.5671182Z { 2025-12-04T09:19:02.5671487Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5671860Z "size": 141, 2025-12-04T09:19:02.5672234Z "digest": "sha256:083e42cac090e6486c35f392b64ee54448f5e4aa947003aeb3e1f92c8ea5c099" 2025-12-04T09:19:02.5672662Z }, 2025-12-04T09:19:02.5672833Z { 2025-12-04T09:19:02.5673137Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5673528Z "size": 32, 2025-12-04T09:19:02.5673910Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5674341Z }, 2025-12-04T09:19:02.5674514Z { 2025-12-04T09:19:02.5674819Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5675203Z "size": 223, 2025-12-04T09:19:02.5675580Z "digest": "sha256:0a00b784a4aac341795729b254f7edd09e811b7f51d0c58e0e6bfeeee6940503" 2025-12-04T09:19:02.5676015Z }, 2025-12-04T09:19:02.5676188Z { 2025-12-04T09:19:02.5676517Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5676923Z "size": 255, 2025-12-04T09:19:02.5677291Z "digest": "sha256:c6173c779f7ba143a21214ea5f032b141863a37ceb4c0ac01d3248c216ce5241" 2025-12-04T09:19:02.5677719Z }, 2025-12-04T09:19:02.5677898Z { 2025-12-04T09:19:02.5678196Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5678676Z "size": 145520672, 2025-12-04T09:19:02.5679074Z "digest": "sha256:ed3d1e3387b924585c332bf1bc252fa159cd0d25256a874043ff0141b1ab5ff7" 2025-12-04T09:19:02.5679581Z }, 2025-12-04T09:19:02.5679757Z { 2025-12-04T09:19:02.5680064Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5680451Z "size": 106, 2025-12-04T09:19:02.5680812Z "digest": "sha256:b29343478586aeee19d2a622661716f6f1591280c890f49b727a8da13a610784" 2025-12-04T09:19:02.5681234Z }, 2025-12-04T09:19:02.5681416Z { 2025-12-04T09:19:02.5681718Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5682099Z "size": 312293530, 2025-12-04T09:19:02.5682491Z "digest": "sha256:c6f0520487fb506bc4601fd84d5f28d8a76b203e004731e4b2067c2ab1a14e0b" 2025-12-04T09:19:02.5682915Z }, 2025-12-04T09:19:02.5683095Z { 2025-12-04T09:19:02.5683403Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5683799Z "size": 3058011133, 2025-12-04T09:19:02.5684304Z "digest": "sha256:148171691cd4c4d20310d490d4b4dd903490d04ea07fb8f7e668a28768683e9a" 2025-12-04T09:19:02.5684735Z }, 2025-12-04T09:19:02.5684914Z { 2025-12-04T09:19:02.5685217Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5685601Z "size": 129, 2025-12-04T09:19:02.5685987Z "digest": "sha256:2c666d30ed77fff9ff1167d41cd645dad98280fcbe941f5bc3828c7ae66b1287" 2025-12-04T09:19:02.5686418Z }, 2025-12-04T09:19:02.5686594Z { 2025-12-04T09:19:02.5686897Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5687273Z "size": 880, 2025-12-04T09:19:02.5687652Z "digest": "sha256:5d8d3a0a98e012c5068e0f3bae5a03e3148ecf2d063634eee4c9241a1e3fdfb5" 2025-12-04T09:19:02.5688087Z }, 2025-12-04T09:19:02.5688262Z { 2025-12-04T09:19:02.5688567Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5688947Z "size": 724, 2025-12-04T09:19:02.5689331Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:19:02.5689748Z }, 2025-12-04T09:19:02.5689927Z { 2025-12-04T09:19:02.5690234Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5690615Z "size": 139, 2025-12-04T09:19:02.5690987Z "digest": "sha256:b06bafce9e817295d8127207747c80aa18e04392ff0875844fc30a1e794a8a0c" 2025-12-04T09:19:02.5691417Z }, 2025-12-04T09:19:02.5691591Z { 2025-12-04T09:19:02.5691898Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5692280Z "size": 32, 2025-12-04T09:19:02.5692660Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5693098Z }, 2025-12-04T09:19:02.5693276Z { 2025-12-04T09:19:02.5693575Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5693957Z "size": 159, 2025-12-04T09:19:02.5694337Z "digest": "sha256:15e0d7e4590d3d8f598d05aec3a92f891bf8b4605bcc38cc2de852b6014ef8f3" 2025-12-04T09:19:02.5694782Z }, 2025-12-04T09:19:02.5694954Z { 2025-12-04T09:19:02.5695260Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5695644Z "size": 1011, 2025-12-04T09:19:02.5696021Z "digest": "sha256:a514bd1add3164d8d7ca99aa19294c4ed8b97b074635d98714c4f598a959f4cd" 2025-12-04T09:19:02.5696493Z }, 2025-12-04T09:19:02.5696685Z { 2025-12-04T09:19:02.5696987Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5697370Z "size": 724, 2025-12-04T09:19:02.5697740Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:19:02.5698159Z }, 2025-12-04T09:19:02.5698340Z { 2025-12-04T09:19:02.5698648Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5699030Z "size": 134, 2025-12-04T09:19:02.5699405Z "digest": "sha256:57b84ee6000204f27a1d9bca199b19be4c86ecd324540dbdf239c56a6c3b34ea" 2025-12-04T09:19:02.5699933Z }, 2025-12-04T09:19:02.5700114Z { 2025-12-04T09:19:02.5700912Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5701333Z "size": 32, 2025-12-04T09:19:02.5701731Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5702169Z }, 2025-12-04T09:19:02.5702352Z { 2025-12-04T09:19:02.5702667Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5703049Z "size": 157, 2025-12-04T09:19:02.5703446Z "digest": "sha256:b8babeff6d817a5961dddc15c6bdfdbd05da187fae75d5804015f99fd7c066d8" 2025-12-04T09:19:02.5703900Z }, 2025-12-04T09:19:02.5704077Z { 2025-12-04T09:19:02.5704385Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5704773Z "size": 602, 2025-12-04T09:19:02.5705157Z "digest": "sha256:83779ddf6a85ab387f64a45f274cba245b69e4fd1931ff0b5d7d3efd4b7a43bc" 2025-12-04T09:19:02.5705596Z }, 2025-12-04T09:19:02.5705785Z { 2025-12-04T09:19:02.5706283Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5706674Z "size": 724, 2025-12-04T09:19:02.5707046Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:19:02.5707473Z }, 2025-12-04T09:19:02.5707647Z { 2025-12-04T09:19:02.5707957Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5708351Z "size": 155, 2025-12-04T09:19:02.5708731Z "digest": "sha256:8b7620c0d736cc79381207ce5afe2af90f0cd7f0cd394577d2c9520d7f74762f" 2025-12-04T09:19:02.5709173Z }, 2025-12-04T09:19:02.5709356Z { 2025-12-04T09:19:02.5709661Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5710048Z "size": 32, 2025-12-04T09:19:02.5710434Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5710871Z }, 2025-12-04T09:19:02.5711047Z { 2025-12-04T09:19:02.5711363Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5711757Z "size": 188, 2025-12-04T09:19:02.5712139Z "digest": "sha256:3bcfa090e4efd3677425f76baea9f1e0c50a75d8c6b5713ec05310f1dff24539" 2025-12-04T09:19:02.5712587Z }, 2025-12-04T09:19:02.5712770Z { 2025-12-04T09:19:02.5713075Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5713462Z "size": 1370, 2025-12-04T09:19:02.5713850Z "digest": "sha256:eb0504ec4d9218a79896b604f73dc0ea5a0f96266ad9c2cdbbbe5f0f18222694" 2025-12-04T09:19:02.5714287Z }, 2025-12-04T09:19:02.5714476Z { 2025-12-04T09:19:02.5714788Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5715177Z "size": 32, 2025-12-04T09:19:02.5715561Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5716000Z }, 2025-12-04T09:19:02.5716178Z { 2025-12-04T09:19:02.5716485Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5716881Z "size": 136, 2025-12-04T09:19:02.5717263Z "digest": "sha256:15d0fec09d7b196a1462d51516ee90fc3443ba178d3e56d59cacf32146b4321d" 2025-12-04T09:19:02.5717702Z }, 2025-12-04T09:19:02.5717889Z { 2025-12-04T09:19:02.5718195Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5718582Z "size": 528, 2025-12-04T09:19:02.5718973Z "digest": "sha256:cca81fcc62a949959ca4dd3c9056fb293d548ef8607127eeeef6cfd3a8897ca8" 2025-12-04T09:19:02.5719409Z }, 2025-12-04T09:19:02.5719651Z { 2025-12-04T09:19:02.5719961Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5720341Z "size": 32, 2025-12-04T09:19:02.5720726Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5721165Z }, 2025-12-04T09:19:02.5721351Z { 2025-12-04T09:19:02.5721659Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5722175Z "size": 104, 2025-12-04T09:19:02.5722576Z "digest": "sha256:b0b8f9b5c6ab98db9cd830dc584e1b6aec9add139e4cc48d8c243d36691e25b4" 2025-12-04T09:19:02.5723020Z }, 2025-12-04T09:19:02.5723204Z { 2025-12-04T09:19:02.5723515Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5723899Z "size": 435, 2025-12-04T09:19:02.5724281Z "digest": "sha256:0606ca4d47a8a70e91e92b03ca51a85e731641b09342136a54ef2f2a6d9dfb44" 2025-12-04T09:19:02.5724722Z }, 2025-12-04T09:19:02.5724902Z { 2025-12-04T09:19:02.5725212Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5725595Z "size": 32, 2025-12-04T09:19:02.5725972Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5726418Z }, 2025-12-04T09:19:02.5726630Z { 2025-12-04T09:19:02.5726962Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5727342Z "size": 109, 2025-12-04T09:19:02.5727819Z "digest": "sha256:2f80a4e1b3b95ed67bb781ea787e8a63e46de79117d9d8e65c257072b38afa2d" 2025-12-04T09:19:02.5728264Z }, 2025-12-04T09:19:02.5728444Z { 2025-12-04T09:19:02.5728751Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5729134Z "size": 1896, 2025-12-04T09:19:02.5729512Z "digest": "sha256:35c916fb1bd057e517dcab78c3a2a018e68096d8993892ad84f47562d37ae352" 2025-12-04T09:19:02.5729938Z }, 2025-12-04T09:19:02.5730122Z { 2025-12-04T09:19:02.5730429Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5730819Z "size": 197526165, 2025-12-04T09:19:02.5731210Z "digest": "sha256:195537b7dafc96192f768323b1a8cc2a914d41959849b73198579576b0872a44" 2025-12-04T09:19:02.5731635Z }, 2025-12-04T09:19:02.5731811Z { 2025-12-04T09:19:02.5732118Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5732507Z "size": 106, 2025-12-04T09:19:02.5732880Z "digest": "sha256:dc454fd3967e5735b2498b7f1d958a2c626987d5e4ce225ca98da3cd945b59f3" 2025-12-04T09:19:02.5733324Z }, 2025-12-04T09:19:02.5733502Z { 2025-12-04T09:19:02.5733805Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5734195Z "size": 165, 2025-12-04T09:19:02.5734576Z "digest": "sha256:701b34f115fa897181c046dc37288e87cbc3ad74c36a9e2224b5bfe7c5703afb" 2025-12-04T09:19:02.5735007Z }, 2025-12-04T09:19:02.5735191Z { 2025-12-04T09:19:02.5735499Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5735888Z "size": 7944, 2025-12-04T09:19:02.5736278Z "digest": "sha256:39cefc00ffedebc9098261c798408b87a20c95a88fccb110594077f48dadf760" 2025-12-04T09:19:02.5736719Z }, 2025-12-04T09:19:02.5736898Z { 2025-12-04T09:19:02.5737208Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5737593Z "size": 8071, 2025-12-04T09:19:02.5737979Z "digest": "sha256:6ae51eb61a325b2c2995a5088c81aa20821b75be65b5aa722c7c40556b5d03ea" 2025-12-04T09:19:02.5738420Z }, 2025-12-04T09:19:02.5738617Z { 2025-12-04T09:19:02.5738929Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5739321Z "size": 304, 2025-12-04T09:19:02.5739710Z "digest": "sha256:1fd5341e66dfc0c1ae23af014641a92a6fd02640c528fe6d4dc55921ed659a26" 2025-12-04T09:19:02.5740154Z }, 2025-12-04T09:19:02.5748656Z { 2025-12-04T09:19:02.5748992Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5749388Z "size": 13364291, 2025-12-04T09:19:02.5749785Z "digest": "sha256:72a7c87e35e40ab796f90aee1b51add7902f0cdc44406d2505b6c6a1f55a8da6" 2025-12-04T09:19:02.5750214Z }, 2025-12-04T09:19:02.5750384Z { 2025-12-04T09:19:02.5750687Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5751065Z "size": 108, 2025-12-04T09:19:02.5751444Z "digest": "sha256:ec36862ac98ebaac52ee1a8b1d162d45bd0e3bf59ae7e19c8f80ad3960b4c600" 2025-12-04T09:19:02.5752004Z }, 2025-12-04T09:19:02.5752175Z { 2025-12-04T09:19:02.5752475Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5752855Z "size": 54145699, 2025-12-04T09:19:02.5753242Z "digest": "sha256:05ddbf246e8add0e293474dbf88bb028d5a295a25ac59e8648a18db644377773" 2025-12-04T09:19:02.5753672Z }, 2025-12-04T09:19:02.5753845Z { 2025-12-04T09:19:02.5754146Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:19:02.5754516Z "size": 32, 2025-12-04T09:19:02.5754889Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:19:02.5755319Z } 2025-12-04T09:19:02.5755490Z ] 2025-12-04T09:19:02.5755657Z } 2025-12-04T09:19:02.5755839Z + exit 0 2025-12-04T09:19:02.5781974Z ##[group]Run set -eux 2025-12-04T09:19:02.5782236Z set -eux 2025-12-04T09:19:02.5782628Z # It's ok if this steps fails, it would then be an anonymous user like what we used to have 2025-12-04T09:19:02.5783803Z aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token | jq --raw-output '.SecretString' | jq -r .docker_hub_readonly_token | docker login --username pytorchbot --password-stdin || true 2025-12-04T09:19:02.5793452Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:02.5793809Z env: 2025-12-04T09:19:02.5794016Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:02.5794260Z ##[endgroup] 2025-12-04T09:19:02.5829483Z + aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token 2025-12-04T09:19:02.5830606Z + jq --raw-output .SecretString 2025-12-04T09:19:02.5831646Z + jq -r .docker_hub_readonly_token 2025-12-04T09:19:02.5833213Z + docker login --username pytorchbot --password-stdin 2025-12-04T09:19:03.1556980Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:19:03.1557757Z Configure a credential helper to remove this warning. See 2025-12-04T09:19:03.1558340Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:19:03.1558821Z 2025-12-04T09:19:03.1558967Z Login Succeeded 2025-12-04T09:19:03.1652929Z ##[group]Run tag=${ECR_DOCKER_IMAGE##*:} 2025-12-04T09:19:03.1653284Z tag=${ECR_DOCKER_IMAGE##*:} 2025-12-04T09:19:03.1653660Z echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}" 2025-12-04T09:19:03.1662750Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:03.1663101Z env: 2025-12-04T09:19:03.1663309Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:03.1664092Z ECR_DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.1664891Z ##[endgroup] 2025-12-04T09:19:03.1699365Z docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.1746276Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2025-12-04T09:19:03.1746713Z with: 2025-12-04T09:19:03.1747489Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.1748388Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:03.1748762Z env: 2025-12-04T09:19:03.1748967Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:03.1749221Z ##[endgroup] 2025-12-04T09:19:03.1764902Z ##[group]Run set -x 2025-12-04T09:19:03.1765163Z set -x 2025-12-04T09:19:03.1765382Z set +e 2025-12-04T09:19:03.1765596Z  2025-12-04T09:19:03.1765803Z login() { 2025-12-04T09:19:03.1778617Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T09:19:03.1779162Z } 2025-12-04T09:19:03.1779372Z  2025-12-04T09:19:03.1779612Z retry () { 2025-12-04T09:19:03.1779874Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T09:19:03.1780377Z } 2025-12-04T09:19:03.1780582Z  2025-12-04T09:19:03.1780813Z retry login "${DOCKER_REGISTRY}" 2025-12-04T09:19:03.1781109Z  2025-12-04T09:19:03.1781563Z IMAGE_SIZE=$(docker manifest inspect "${DOCKER_IMAGE}" | jq '[.layers[].size, .config.size] | add / 1024 / 1024') 2025-12-04T09:19:03.1782181Z echo "Compressed size of image in MB: ${IMAGE_SIZE}" 2025-12-04T09:19:03.1782530Z  2025-12-04T09:19:03.1782738Z set -e 2025-12-04T09:19:03.1783059Z # ignore output since only exit code is used for conditional 2025-12-04T09:19:03.1783527Z # only pull docker image if it's not available locally 2025-12-04T09:19:03.1784024Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2025-12-04T09:19:03.1784507Z  retry docker pull "${DOCKER_IMAGE}" 2025-12-04T09:19:03.1784810Z fi 2025-12-04T09:19:03.1793839Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:19:03.1794202Z env: 2025-12-04T09:19:03.1794412Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:19:03.1795177Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.1796052Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:03.1796416Z ##[endgroup] 2025-12-04T09:19:03.1827472Z + set +e 2025-12-04T09:19:03.1827927Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:03.1828367Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:03.1832148Z + aws ecr get-login-password --region us-east-1 2025-12-04T09:19:03.1832980Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:19:03.7000856Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:19:03.7001433Z Configure a credential helper to remove this warning. See 2025-12-04T09:19:03.7001980Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:19:03.7002341Z 2025-12-04T09:19:03.7002908Z Login Succeeded 2025-12-04T09:19:03.7036595Z ++ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.7037570Z ++ jq '[.layers[].size, .config.size] | add / 1024 / 1024' 2025-12-04T09:19:03.9179964Z + IMAGE_SIZE=15091.581844329834 2025-12-04T09:19:03.9180453Z + echo 'Compressed size of image in MB: 15091.581844329834' 2025-12-04T09:19:03.9180806Z + set -e 2025-12-04T09:19:03.9181597Z + docker inspect --type=image 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.9182770Z Compressed size of image in MB: 15091.581844329834 2025-12-04T09:19:03.9332812Z + retry docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:03.9334172Z + docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:19:04.1470919Z pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a: Pulling from pytorch/ci-image 2025-12-04T09:19:04.1471632Z 63e5bc7682b8: Pulling fs layer 2025-12-04T09:19:04.1471901Z 0678d56345c9: Pulling fs layer 2025-12-04T09:19:04.1472176Z 45f5c9ddfce7: Pulling fs layer 2025-12-04T09:19:04.1472444Z 086b1df51ac1: Pulling fs layer 2025-12-04T09:19:04.1472797Z fe8a7b64bf98: Pulling fs layer 2025-12-04T09:19:04.1473182Z 7680723e9a57: Pulling fs layer 2025-12-04T09:19:04.1473585Z 9c5027aeeb4e: Pulling fs layer 2025-12-04T09:19:04.1473971Z 9a5652110360: Pulling fs layer 2025-12-04T09:19:04.1474357Z 375c4427e914: Pulling fs layer 2025-12-04T09:19:04.1474979Z a86faaa7dbdd: Pulling fs layer 2025-12-04T09:19:04.1475336Z fb7848686804: Pulling fs layer 2025-12-04T09:19:04.1475641Z 3541df015cdb: Pulling fs layer 2025-12-04T09:19:04.1475907Z 79dc80f426b2: Pulling fs layer 2025-12-04T09:19:04.1476179Z a13fcc1b90bb: Pulling fs layer 2025-12-04T09:19:04.1476439Z 4f4fb700ef54: Pulling fs layer 2025-12-04T09:19:04.1476701Z 549db4d6c618: Pulling fs layer 2025-12-04T09:19:04.1476966Z 5c63528cb580: Pulling fs layer 2025-12-04T09:19:04.1477222Z 75bd83b989a4: Pulling fs layer 2025-12-04T09:19:04.1477488Z de6e78970f51: Pulling fs layer 2025-12-04T09:19:04.1477751Z e13ed7c7e473: Pulling fs layer 2025-12-04T09:19:04.1478028Z 6e2949bcb741: Pulling fs layer 2025-12-04T09:19:04.1478297Z 14d69d9aaec7: Pulling fs layer 2025-12-04T09:19:04.1478556Z 5c02769dd8e5: Pulling fs layer 2025-12-04T09:19:04.1478827Z 35041ce524ac: Pulling fs layer 2025-12-04T09:19:04.1479093Z 2fa92dc5885e: Pulling fs layer 2025-12-04T09:19:04.1479368Z 2b85eafbd92a: Pulling fs layer 2025-12-04T09:19:04.1479780Z ff755a4ddad7: Pulling fs layer 2025-12-04T09:19:04.1480155Z 09eb41bdf42d: Pulling fs layer 2025-12-04T09:19:04.1480551Z 11ede4d59e93: Pulling fs layer 2025-12-04T09:19:04.1480944Z 1283cd8f801a: Pulling fs layer 2025-12-04T09:19:04.1481338Z 024fa855425f: Pulling fs layer 2025-12-04T09:19:04.1481729Z 303e6747a62e: Pulling fs layer 2025-12-04T09:19:04.1482107Z 3017cdf4838b: Pulling fs layer 2025-12-04T09:19:04.1482479Z 6b6cd1c358e8: Pulling fs layer 2025-12-04T09:19:04.1482751Z b2dd04501124: Pulling fs layer 2025-12-04T09:19:04.1483014Z 55adc51fe589: Pulling fs layer 2025-12-04T09:19:04.1483270Z a43ca0e4b837: Pulling fs layer 2025-12-04T09:19:04.1483548Z b7212f17fd14: Pulling fs layer 2025-12-04T09:19:04.1483918Z 083e42cac090: Pulling fs layer 2025-12-04T09:19:04.1484211Z 0a00b784a4aa: Pulling fs layer 2025-12-04T09:19:04.1484563Z c6173c779f7b: Pulling fs layer 2025-12-04T09:19:04.1484875Z ed3d1e3387b9: Pulling fs layer 2025-12-04T09:19:04.1485239Z b29343478586: Pulling fs layer 2025-12-04T09:19:04.1485559Z c6f0520487fb: Pulling fs layer 2025-12-04T09:19:04.1485864Z 148171691cd4: Pulling fs layer 2025-12-04T09:19:04.1486123Z 2c666d30ed77: Pulling fs layer 2025-12-04T09:19:04.1486378Z 5d8d3a0a98e0: Pulling fs layer 2025-12-04T09:19:04.1486644Z b06bafce9e81: Pulling fs layer 2025-12-04T09:19:04.1486909Z 15e0d7e4590d: Pulling fs layer 2025-12-04T09:19:04.1487172Z a514bd1add31: Pulling fs layer 2025-12-04T09:19:04.1487428Z 57b84ee60002: Pulling fs layer 2025-12-04T09:19:04.1487780Z b8babeff6d81: Pulling fs layer 2025-12-04T09:19:04.1488062Z 83779ddf6a85: Pulling fs layer 2025-12-04T09:19:04.1488356Z 8b7620c0d736: Pulling fs layer 2025-12-04T09:19:04.1488702Z 3bcfa090e4ef: Pulling fs layer 2025-12-04T09:19:04.1488978Z eb0504ec4d92: Pulling fs layer 2025-12-04T09:19:04.1489236Z 15d0fec09d7b: Pulling fs layer 2025-12-04T09:19:04.1489500Z cca81fcc62a9: Pulling fs layer 2025-12-04T09:19:04.1489993Z b0b8f9b5c6ab: Pulling fs layer 2025-12-04T09:19:04.1490265Z 0606ca4d47a8: Pulling fs layer 2025-12-04T09:19:04.1490537Z 2f80a4e1b3b9: Pulling fs layer 2025-12-04T09:19:04.1490803Z 35c916fb1bd0: Pulling fs layer 2025-12-04T09:19:04.1491064Z 195537b7dafc: Pulling fs layer 2025-12-04T09:19:04.1491326Z dc454fd3967e: Pulling fs layer 2025-12-04T09:19:04.1491593Z 701b34f115fa: Pulling fs layer 2025-12-04T09:19:04.1491849Z 39cefc00ffed: Pulling fs layer 2025-12-04T09:19:04.1492115Z 6ae51eb61a32: Pulling fs layer 2025-12-04T09:19:04.1492390Z 1fd5341e66df: Pulling fs layer 2025-12-04T09:19:04.1492652Z 72a7c87e35e4: Pulling fs layer 2025-12-04T09:19:04.1492904Z ec36862ac98e: Pulling fs layer 2025-12-04T09:19:04.1493176Z 05ddbf246e8a: Pulling fs layer 2025-12-04T09:19:04.1493427Z 4f4fb700ef54: Waiting 2025-12-04T09:19:04.1493645Z fb7848686804: Waiting 2025-12-04T09:19:04.1493866Z 086b1df51ac1: Waiting 2025-12-04T09:19:04.1494096Z b29343478586: Waiting 2025-12-04T09:19:04.1494317Z c6f0520487fb: Waiting 2025-12-04T09:19:04.1494541Z 6b6cd1c358e8: Waiting 2025-12-04T09:19:04.1494761Z 148171691cd4: Waiting 2025-12-04T09:19:04.1495103Z 79dc80f426b2: Waiting 2025-12-04T09:19:04.1495335Z 2c666d30ed77: Waiting 2025-12-04T09:19:04.1495570Z 9c5027aeeb4e: Waiting 2025-12-04T09:19:04.1495798Z 15e0d7e4590d: Waiting 2025-12-04T09:19:04.1496031Z 5d8d3a0a98e0: Waiting 2025-12-04T09:19:04.1496268Z a514bd1add31: Waiting 2025-12-04T09:19:04.1496502Z 57b84ee60002: Waiting 2025-12-04T09:19:04.1496732Z 3541df015cdb: Waiting 2025-12-04T09:19:04.1496996Z b06bafce9e81: Waiting 2025-12-04T09:19:04.1497230Z 549db4d6c618: Waiting 2025-12-04T09:19:04.1497455Z 83779ddf6a85: Waiting 2025-12-04T09:19:04.1497674Z 5c63528cb580: Waiting 2025-12-04T09:19:04.1497891Z b8babeff6d81: Waiting 2025-12-04T09:19:04.1498117Z 375c4427e914: Waiting 2025-12-04T09:19:04.1498338Z a13fcc1b90bb: Waiting 2025-12-04T09:19:04.1498559Z 9a5652110360: Waiting 2025-12-04T09:19:04.1498788Z b2dd04501124: Waiting 2025-12-04T09:19:04.1499014Z 55adc51fe589: Waiting 2025-12-04T09:19:04.1499246Z a43ca0e4b837: Waiting 2025-12-04T09:19:04.1499475Z 083e42cac090: Waiting 2025-12-04T09:19:04.1499701Z b7212f17fd14: Waiting 2025-12-04T09:19:04.1499953Z de6e78970f51: Waiting 2025-12-04T09:19:04.1500167Z 6e2949bcb741: Waiting 2025-12-04T09:19:04.1500688Z 3017cdf4838b: Waiting 2025-12-04T09:19:04.1500911Z 303e6747a62e: Waiting 2025-12-04T09:19:04.1501125Z c6173c779f7b: Waiting 2025-12-04T09:19:04.1501343Z 7680723e9a57: Waiting 2025-12-04T09:19:04.1501560Z 35041ce524ac: Waiting 2025-12-04T09:19:04.1501776Z 2b85eafbd92a: Waiting 2025-12-04T09:19:04.1502005Z e13ed7c7e473: Waiting 2025-12-04T09:19:04.1502228Z 09eb41bdf42d: Waiting 2025-12-04T09:19:04.1502445Z ed3d1e3387b9: Waiting 2025-12-04T09:19:04.1502668Z ff755a4ddad7: Waiting 2025-12-04T09:19:04.1502891Z fe8a7b64bf98: Waiting 2025-12-04T09:19:04.1503111Z 195537b7dafc: Waiting 2025-12-04T09:19:04.1503332Z 11ede4d59e93: Waiting 2025-12-04T09:19:04.1503553Z 1fd5341e66df: Waiting 2025-12-04T09:19:04.1503767Z 2fa92dc5885e: Waiting 2025-12-04T09:19:04.1504005Z 15d0fec09d7b: Waiting 2025-12-04T09:19:04.1504234Z 701b34f115fa: Waiting 2025-12-04T09:19:04.1504453Z 5c02769dd8e5: Waiting 2025-12-04T09:19:04.1504680Z dc454fd3967e: Waiting 2025-12-04T09:19:04.1504904Z 39cefc00ffed: Waiting 2025-12-04T09:19:04.1505132Z ec36862ac98e: Waiting 2025-12-04T09:19:04.1505352Z 3bcfa090e4ef: Waiting 2025-12-04T09:19:04.1505577Z 6ae51eb61a32: Waiting 2025-12-04T09:19:04.1505800Z 05ddbf246e8a: Waiting 2025-12-04T09:19:04.1506030Z 2f80a4e1b3b9: Waiting 2025-12-04T09:19:04.1506360Z a86faaa7dbdd: Waiting 2025-12-04T09:19:04.1506603Z cca81fcc62a9: Waiting 2025-12-04T09:19:04.1506828Z 024fa855425f: Waiting 2025-12-04T09:19:04.1507196Z 0606ca4d47a8: Waiting 2025-12-04T09:19:04.1507460Z 35c916fb1bd0: Waiting 2025-12-04T09:19:04.1507745Z b0b8f9b5c6ab: Waiting 2025-12-04T09:19:04.1507967Z 8b7620c0d736: Waiting 2025-12-04T09:19:04.1508190Z 72a7c87e35e4: Waiting 2025-12-04T09:19:04.1508408Z 1283cd8f801a: Waiting 2025-12-04T09:19:04.1508630Z eb0504ec4d92: Waiting 2025-12-04T09:19:04.1509022Z 0a00b784a4aa: Waiting 2025-12-04T09:19:04.1509250Z 75bd83b989a4: Waiting 2025-12-04T09:19:04.1509479Z 14d69d9aaec7: Waiting 2025-12-04T09:19:04.2410549Z 0678d56345c9: Verifying Checksum 2025-12-04T09:19:04.2410885Z 0678d56345c9: Download complete 2025-12-04T09:19:04.3193521Z 086b1df51ac1: Download complete 2025-12-04T09:19:04.4062267Z fe8a7b64bf98: Verifying Checksum 2025-12-04T09:19:04.4062593Z fe8a7b64bf98: Download complete 2025-12-04T09:19:04.4839312Z 7680723e9a57: Download complete 2025-12-04T09:19:04.5052665Z 63e5bc7682b8: Verifying Checksum 2025-12-04T09:19:04.5053136Z 63e5bc7682b8: Download complete 2025-12-04T09:19:04.5776411Z 9a5652110360: Verifying Checksum 2025-12-04T09:19:04.5776825Z 9a5652110360: Download complete 2025-12-04T09:19:04.5777562Z 9c5027aeeb4e: Verifying Checksum 2025-12-04T09:19:04.5777854Z 9c5027aeeb4e: Download complete 2025-12-04T09:19:04.6813145Z a86faaa7dbdd: Download complete 2025-12-04T09:19:04.7680770Z fb7848686804: Verifying Checksum 2025-12-04T09:19:04.7681220Z fb7848686804: Download complete 2025-12-04T09:19:04.8391364Z 3541df015cdb: Download complete 2025-12-04T09:19:04.9408510Z 79dc80f426b2: Download complete 2025-12-04T09:19:05.7146726Z 63e5bc7682b8: Pull complete 2025-12-04T09:19:05.7382331Z 375c4427e914: Verifying Checksum 2025-12-04T09:19:05.7382733Z 375c4427e914: Download complete 2025-12-04T09:19:05.7383014Z 0678d56345c9: Pull complete 2025-12-04T09:19:05.7481895Z 4f4fb700ef54: Verifying Checksum 2025-12-04T09:19:05.7482293Z 4f4fb700ef54: Download complete 2025-12-04T09:19:05.8057996Z 549db4d6c618: Verifying Checksum 2025-12-04T09:19:05.8058308Z 549db4d6c618: Download complete 2025-12-04T09:19:05.8853989Z 5c63528cb580: Verifying Checksum 2025-12-04T09:19:05.8854287Z 5c63528cb580: Download complete 2025-12-04T09:19:05.9768316Z 75bd83b989a4: Verifying Checksum 2025-12-04T09:19:05.9768622Z 75bd83b989a4: Download complete 2025-12-04T09:19:06.0630987Z de6e78970f51: Verifying Checksum 2025-12-04T09:19:06.0631303Z de6e78970f51: Download complete 2025-12-04T09:19:06.2041758Z e13ed7c7e473: Verifying Checksum 2025-12-04T09:19:06.2042191Z e13ed7c7e473: Download complete 2025-12-04T09:19:06.2868401Z 6e2949bcb741: Verifying Checksum 2025-12-04T09:19:06.2868709Z 6e2949bcb741: Download complete 2025-12-04T09:19:06.3799537Z 14d69d9aaec7: Verifying Checksum 2025-12-04T09:19:06.3799842Z 14d69d9aaec7: Download complete 2025-12-04T09:19:06.4520991Z 5c02769dd8e5: Download complete 2025-12-04T09:19:07.3306111Z 45f5c9ddfce7: Verifying Checksum 2025-12-04T09:19:07.3306462Z 45f5c9ddfce7: Download complete 2025-12-04T09:19:07.4249975Z 2fa92dc5885e: Verifying Checksum 2025-12-04T09:19:07.4250371Z 2fa92dc5885e: Download complete 2025-12-04T09:19:07.8219569Z 2b85eafbd92a: Verifying Checksum 2025-12-04T09:19:07.8219938Z 2b85eafbd92a: Download complete 2025-12-04T09:19:07.9147097Z ff755a4ddad7: Verifying Checksum 2025-12-04T09:19:07.9147431Z ff755a4ddad7: Download complete 2025-12-04T09:19:08.0083212Z 09eb41bdf42d: Download complete 2025-12-04T09:19:12.6650432Z 11ede4d59e93: Verifying Checksum 2025-12-04T09:19:12.6650792Z 11ede4d59e93: Download complete 2025-12-04T09:19:12.7741175Z 1283cd8f801a: Download complete 2025-12-04T09:19:12.8497631Z 024fa855425f: Verifying Checksum 2025-12-04T09:19:12.8498069Z 024fa855425f: Download complete 2025-12-04T09:19:12.9248020Z 303e6747a62e: Download complete 2025-12-04T09:19:13.0107235Z 3017cdf4838b: Download complete 2025-12-04T09:19:13.2494557Z 6b6cd1c358e8: Verifying Checksum 2025-12-04T09:19:13.2495023Z 6b6cd1c358e8: Download complete 2025-12-04T09:19:13.3412632Z b2dd04501124: Verifying Checksum 2025-12-04T09:19:13.3413072Z b2dd04501124: Download complete 2025-12-04T09:19:13.4167551Z 55adc51fe589: Verifying Checksum 2025-12-04T09:19:13.4167961Z 55adc51fe589: Download complete 2025-12-04T09:19:13.4954267Z a43ca0e4b837: Verifying Checksum 2025-12-04T09:19:13.4954609Z a43ca0e4b837: Download complete 2025-12-04T09:19:13.5726209Z b7212f17fd14: Verifying Checksum 2025-12-04T09:19:13.5726554Z b7212f17fd14: Download complete 2025-12-04T09:19:13.7130131Z 083e42cac090: Download complete 2025-12-04T09:19:13.8170566Z 0a00b784a4aa: Verifying Checksum 2025-12-04T09:19:13.8171110Z 0a00b784a4aa: Download complete 2025-12-04T09:19:13.8989804Z c6173c779f7b: Verifying Checksum 2025-12-04T09:19:13.8990168Z c6173c779f7b: Download complete 2025-12-04T09:19:15.4061942Z ed3d1e3387b9: Verifying Checksum 2025-12-04T09:19:15.4062419Z ed3d1e3387b9: Download complete 2025-12-04T09:19:15.4998617Z b29343478586: Verifying Checksum 2025-12-04T09:19:15.4999047Z b29343478586: Download complete 2025-12-04T09:19:17.6019775Z 45f5c9ddfce7: Pull complete 2025-12-04T09:19:17.7924549Z 086b1df51ac1: Pull complete 2025-12-04T09:19:17.9448113Z fe8a7b64bf98: Pull complete 2025-12-04T09:19:18.1053516Z 7680723e9a57: Pull complete 2025-12-04T09:19:18.3277464Z 9c5027aeeb4e: Pull complete 2025-12-04T09:19:18.4809403Z 9a5652110360: Pull complete 2025-12-04T09:19:18.6752852Z c6f0520487fb: Verifying Checksum 2025-12-04T09:19:18.6753348Z c6f0520487fb: Download complete 2025-12-04T09:19:21.1740936Z 375c4427e914: Pull complete 2025-12-04T09:19:21.3981789Z a86faaa7dbdd: Pull complete 2025-12-04T09:19:21.6173671Z fb7848686804: Pull complete 2025-12-04T09:19:21.8550427Z 3541df015cdb: Pull complete 2025-12-04T09:19:22.0682251Z 79dc80f426b2: Pull complete 2025-12-04T09:19:36.8510463Z a13fcc1b90bb: Verifying Checksum 2025-12-04T09:19:36.8510830Z a13fcc1b90bb: Download complete 2025-12-04T09:19:36.9424301Z 2c666d30ed77: Download complete 2025-12-04T09:19:37.0256161Z 5d8d3a0a98e0: Verifying Checksum 2025-12-04T09:19:37.0256527Z 5d8d3a0a98e0: Download complete 2025-12-04T09:19:37.1172839Z b06bafce9e81: Download complete 2025-12-04T09:19:37.1981851Z 15e0d7e4590d: Verifying Checksum 2025-12-04T09:19:37.1982324Z 15e0d7e4590d: Download complete 2025-12-04T09:19:37.2926317Z a514bd1add31: Download complete 2025-12-04T09:19:37.3810115Z 57b84ee60002: Download complete 2025-12-04T09:19:37.4763291Z b8babeff6d81: Verifying Checksum 2025-12-04T09:19:37.4763638Z b8babeff6d81: Download complete 2025-12-04T09:19:37.5704974Z 83779ddf6a85: Download complete 2025-12-04T09:19:37.6506833Z 8b7620c0d736: Verifying Checksum 2025-12-04T09:19:37.6507217Z 8b7620c0d736: Download complete 2025-12-04T09:19:37.7489637Z 3bcfa090e4ef: Download complete 2025-12-04T09:19:37.8281001Z eb0504ec4d92: Verifying Checksum 2025-12-04T09:19:37.8281382Z eb0504ec4d92: Download complete 2025-12-04T09:19:37.9166985Z 15d0fec09d7b: Download complete 2025-12-04T09:19:37.9904694Z cca81fcc62a9: Download complete 2025-12-04T09:19:38.0939586Z b0b8f9b5c6ab: Download complete 2025-12-04T09:19:38.2048384Z 0606ca4d47a8: Verifying Checksum 2025-12-04T09:19:38.2048709Z 0606ca4d47a8: Download complete 2025-12-04T09:19:38.2778096Z 2f80a4e1b3b9: Download complete 2025-12-04T09:19:38.3649184Z 35c916fb1bd0: Verifying Checksum 2025-12-04T09:19:38.3649561Z 35c916fb1bd0: Download complete 2025-12-04T09:19:40.4060369Z 195537b7dafc: Verifying Checksum 2025-12-04T09:19:40.4060734Z 195537b7dafc: Download complete 2025-12-04T09:19:40.4820145Z dc454fd3967e: Verifying Checksum 2025-12-04T09:19:40.4822002Z dc454fd3967e: Download complete 2025-12-04T09:19:40.5730759Z 701b34f115fa: Verifying Checksum 2025-12-04T09:19:40.5731264Z 701b34f115fa: Download complete 2025-12-04T09:19:40.6521058Z 39cefc00ffed: Verifying Checksum 2025-12-04T09:19:40.6521508Z 39cefc00ffed: Download complete 2025-12-04T09:19:40.7241947Z 6ae51eb61a32: Verifying Checksum 2025-12-04T09:19:40.7242468Z 6ae51eb61a32: Download complete 2025-12-04T09:19:40.8184284Z 1fd5341e66df: Download complete 2025-12-04T09:19:41.0084463Z 72a7c87e35e4: Verifying Checksum 2025-12-04T09:19:41.0084824Z 72a7c87e35e4: Download complete 2025-12-04T09:19:41.0981463Z ec36862ac98e: Verifying Checksum 2025-12-04T09:19:41.0981945Z ec36862ac98e: Download complete 2025-12-04T09:19:41.7058513Z 05ddbf246e8a: Verifying Checksum 2025-12-04T09:19:41.7058900Z 05ddbf246e8a: Download complete 2025-12-04T09:19:49.3051641Z 148171691cd4: Verifying Checksum 2025-12-04T09:19:49.3051999Z 148171691cd4: Download complete 2025-12-04T09:20:25.9333180Z 35041ce524ac: Verifying Checksum 2025-12-04T09:20:25.9333523Z 35041ce524ac: Download complete 2025-12-04T09:20:58.9407478Z a13fcc1b90bb: Pull complete 2025-12-04T09:20:59.1249082Z 4f4fb700ef54: Pull complete 2025-12-04T09:20:59.2359369Z 549db4d6c618: Pull complete 2025-12-04T09:20:59.3947795Z 5c63528cb580: Pull complete 2025-12-04T09:20:59.6160738Z 75bd83b989a4: Pull complete 2025-12-04T09:20:59.8804698Z de6e78970f51: Pull complete 2025-12-04T09:21:00.0081051Z e13ed7c7e473: Pull complete 2025-12-04T09:21:00.1839104Z 6e2949bcb741: Pull complete 2025-12-04T09:21:00.2942596Z 14d69d9aaec7: Pull complete 2025-12-04T09:21:00.4549254Z 5c02769dd8e5: Pull complete 2025-12-04T09:22:31.8854926Z 35041ce524ac: Pull complete 2025-12-04T09:22:32.0861154Z 2fa92dc5885e: Pull complete 2025-12-04T09:22:32.7924846Z 2b85eafbd92a: Pull complete 2025-12-04T09:22:33.0156352Z ff755a4ddad7: Pull complete 2025-12-04T09:22:33.2138091Z 09eb41bdf42d: Pull complete 2025-12-04T09:22:41.7335822Z 11ede4d59e93: Pull complete 2025-12-04T09:22:41.9536360Z 1283cd8f801a: Pull complete 2025-12-04T09:22:42.1808939Z 024fa855425f: Pull complete 2025-12-04T09:22:42.6153778Z 303e6747a62e: Pull complete 2025-12-04T09:22:42.8416463Z 3017cdf4838b: Pull complete 2025-12-04T09:22:43.2797390Z 6b6cd1c358e8: Pull complete 2025-12-04T09:22:43.4918520Z b2dd04501124: Pull complete 2025-12-04T09:22:43.7080664Z 55adc51fe589: Pull complete 2025-12-04T09:22:44.1431541Z a43ca0e4b837: Pull complete 2025-12-04T09:22:44.3574306Z b7212f17fd14: Pull complete 2025-12-04T09:22:44.5825720Z 083e42cac090: Pull complete 2025-12-04T09:22:45.0196897Z 0a00b784a4aa: Pull complete 2025-12-04T09:22:45.2470278Z c6173c779f7b: Pull complete 2025-12-04T09:22:49.0366411Z ed3d1e3387b9: Pull complete 2025-12-04T09:22:49.2688715Z b29343478586: Pull complete 2025-12-04T09:22:50.7405924Z c6f0520487fb: Pull complete 2025-12-04T09:23:52.0616048Z 148171691cd4: Pull complete 2025-12-04T09:23:52.1937294Z 2c666d30ed77: Pull complete 2025-12-04T09:23:52.3817847Z 5d8d3a0a98e0: Pull complete 2025-12-04T09:23:52.8136831Z b06bafce9e81: Pull complete 2025-12-04T09:23:53.0284717Z 15e0d7e4590d: Pull complete 2025-12-04T09:23:53.1873019Z a514bd1add31: Pull complete 2025-12-04T09:23:53.4086356Z 57b84ee60002: Pull complete 2025-12-04T09:23:53.8198949Z b8babeff6d81: Pull complete 2025-12-04T09:23:53.9358806Z 83779ddf6a85: Pull complete 2025-12-04T09:23:54.2725778Z 8b7620c0d736: Pull complete 2025-12-04T09:23:54.4802331Z 3bcfa090e4ef: Pull complete 2025-12-04T09:23:54.6026671Z eb0504ec4d92: Pull complete 2025-12-04T09:23:54.8827088Z 15d0fec09d7b: Pull complete 2025-12-04T09:23:55.0967391Z cca81fcc62a9: Pull complete 2025-12-04T09:23:55.2726301Z b0b8f9b5c6ab: Pull complete 2025-12-04T09:23:55.3093646Z 0606ca4d47a8: Pull complete 2025-12-04T09:23:55.3757922Z 2f80a4e1b3b9: Pull complete 2025-12-04T09:23:55.4142634Z 35c916fb1bd0: Pull complete 2025-12-04T09:24:02.2750851Z 195537b7dafc: Pull complete 2025-12-04T09:24:02.4802154Z dc454fd3967e: Pull complete 2025-12-04T09:24:02.6766922Z 701b34f115fa: Pull complete 2025-12-04T09:24:02.8837000Z 39cefc00ffed: Pull complete 2025-12-04T09:24:03.0896160Z 6ae51eb61a32: Pull complete 2025-12-04T09:24:03.3200129Z 1fd5341e66df: Pull complete 2025-12-04T09:24:05.2124624Z 72a7c87e35e4: Pull complete 2025-12-04T09:24:05.4268157Z ec36862ac98e: Pull complete 2025-12-04T09:24:07.2144634Z 05ddbf246e8a: Pull complete 2025-12-04T09:24:07.4890186Z Digest: sha256:ba21003510dba4bdeed83df81a56fa468e0ee1b612a9445ae1f402a280804f97 2025-12-04T09:24:07.5423905Z Status: Downloaded newer image for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:24:07.5485909Z 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:24:07.5548503Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:24:07.5549436Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:24:07.5561637Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:24:07.5561997Z env: 2025-12-04T09:24:07.5562200Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:07.5562453Z ##[endgroup] 2025-12-04T09:24:07.5837370Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-12-04T09:24:07.5837767Z with: 2025-12-04T09:24:07.5837990Z driver-version: 580.82.07 2025-12-04T09:24:07.5838232Z env: 2025-12-04T09:24:07.5838440Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:07.5838684Z ##[endgroup] 2025-12-04T09:24:07.5862420Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:24:07.5863269Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:24:07.5874570Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:24:07.5874955Z env: 2025-12-04T09:24:07.5875374Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:07.5875623Z ##[endgroup] 2025-12-04T09:24:07.5955532Z ##[group]Run set -euo pipefail 2025-12-04T09:24:07.5955849Z set -euo pipefail 2025-12-04T09:24:07.5956140Z  2025-12-04T09:24:07.5956350Z has_gpu=false 2025-12-04T09:24:07.5956596Z devices="" 2025-12-04T09:24:07.5956823Z  2025-12-04T09:24:07.5957088Z if command -v nvidia-smi >/dev/null 2>&1; then 2025-12-04T09:24:07.5957515Z  if nvidia-smi -L >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:24:07.5957889Z  has_gpu=true 2025-12-04T09:24:07.5958177Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:24:07.5958480Z  fi 2025-12-04T09:24:07.5958692Z fi 2025-12-04T09:24:07.5958895Z  2025-12-04T09:24:07.5959112Z if [ "$has_gpu" = false ]; then 2025-12-04T09:24:07.5959583Z  if ls /dev/nvidia* >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:24:07.5959969Z  has_gpu=true 2025-12-04T09:24:07.5960250Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:24:07.5960542Z  fi 2025-12-04T09:24:07.5960753Z fi 2025-12-04T09:24:07.5960960Z  2025-12-04T09:24:07.5961251Z if [ "$has_gpu" = false ] && command -v lspci >/dev/null 2>&1; then 2025-12-04T09:24:07.5961742Z  if lspci | grep -i 'nvidia' >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:24:07.5962139Z  has_gpu=true 2025-12-04T09:24:07.5962420Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:24:07.5962722Z  fi 2025-12-04T09:24:07.5962931Z fi 2025-12-04T09:24:07.5963124Z  2025-12-04T09:24:07.5963416Z printf 'HAS_NVIDIA=%s\n' "$has_gpu" >> "$GITHUB_OUTPUT" 2025-12-04T09:24:07.5963918Z printf 'DETECTED_DEVICES<> "$GITHUB_OUTPUT" 2025-12-04T09:24:07.5972512Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:24:07.5972872Z env: 2025-12-04T09:24:07.5973089Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:07.5973350Z ##[endgroup] 2025-12-04T09:24:09.3365805Z ##[group]Run if [ "${HAS_NVIDIA}" = "true" ]; then 2025-12-04T09:24:09.3366197Z if [ "${HAS_NVIDIA}" = "true" ]; then 2025-12-04T09:24:09.3366559Z  echo "HAS_NVIDIA_GPU=true" >> "${GITHUB_ENV}" 2025-12-04T09:24:09.3367053Z  echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" 2025-12-04T09:24:09.3367486Z else 2025-12-04T09:24:09.3367755Z  echo "HAS_NVIDIA_GPU=false" >> "${GITHUB_ENV}" 2025-12-04T09:24:09.3368081Z fi 2025-12-04T09:24:09.3378792Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:24:09.3379138Z env: 2025-12-04T09:24:09.3379346Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:09.3379595Z HAS_NVIDIA: true 2025-12-04T09:24:09.3379812Z ##[endgroup] 2025-12-04T09:24:09.3496827Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-12-04T09:24:09.3497242Z with: 2025-12-04T09:24:09.3497457Z timeout_minutes: 10 2025-12-04T09:24:09.3497698Z max_attempts: 3 2025-12-04T09:24:09.3522870Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y \ nvidia-container-toolkit-1.17.8 \ libnvidia-container-tools-1.17.8 \ libnvidia-container1-1.17.8 \ nvidia-container-toolkit-base-1.17.8 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-container-toolkit-1.17.8 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi # check if the container-toolkit is correctly installed and CUDA is available inside a container docker run --rm -t --gpus=all public.ecr.aws/docker/library/python:3.13 nvidia-smi 2025-12-04T09:24:09.3547602Z retry_wait_seconds: 10 2025-12-04T09:24:09.3547890Z polling_interval_seconds: 1 2025-12-04T09:24:09.3548179Z warning_on_retry: true 2025-12-04T09:24:09.3548456Z continue_on_error: false 2025-12-04T09:24:09.3548714Z env: 2025-12-04T09:24:09.3548923Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:09.3549196Z HAS_NVIDIA_GPU: true 2025-12-04T09:24:09.3549525Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:24:09.3549905Z DRIVER_VERSION: 580.82.07 2025-12-04T09:24:09.3550176Z ##[endgroup] 2025-12-04T09:24:09.4917769Z == Installing nvidia driver NVIDIA-Linux-x86_64-580.82.07.run == 2025-12-04T09:24:09.4918246Z + pre_install_nvidia_driver_amzn2 2025-12-04T09:24:09.4920966Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-12-04T09:24:10.2884934Z No match for argument: nvidia-driver-latest-dkms 2025-12-04T09:24:10.2885440Z No packages marked for removal. 2025-12-04T09:24:10.2953311Z Dependencies resolved. 2025-12-04T09:24:10.2964237Z Nothing to do. 2025-12-04T09:24:10.2965177Z Complete! 2025-12-04T09:24:10.3832777Z + install_nvidia_driver_common 2025-12-04T09:24:10.3838288Z + echo 'Before installing NVIDIA driver' 2025-12-04T09:24:10.3838623Z + lspci 2025-12-04T09:24:10.3840701Z Before installing NVIDIA driver 2025-12-04T09:24:10.4801747Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-12-04T09:24:10.4802457Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-12-04T09:24:10.4803021Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-12-04T09:24:10.4803552Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-12-04T09:24:10.4804031Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-12-04T09:24:10.4804562Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-12-04T09:24:10.4805047Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-12-04T09:24:10.4805555Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-12-04T09:24:10.4805955Z + lsmod 2025-12-04T09:24:10.4859125Z Module Size Used by 2025-12-04T09:24:10.4859756Z nvidia_uvm 1925120 0 2025-12-04T09:24:10.4860039Z nvidia 14286848 1 nvidia_uvm 2025-12-04T09:24:10.4860331Z drm 602112 1 nvidia 2025-12-04T09:24:10.4860723Z drm_panel_orientation_quirks 32768 1 drm 2025-12-04T09:24:10.4861094Z backlight 24576 1 drm 2025-12-04T09:24:10.4861378Z i2c_core 110592 2 nvidia,drm 2025-12-04T09:24:10.4861668Z xt_conntrack 16384 1 2025-12-04T09:24:10.4861930Z nft_chain_nat 16384 3 2025-12-04T09:24:10.4862187Z xt_MASQUERADE 20480 1 2025-12-04T09:24:10.4862487Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-12-04T09:24:10.4862901Z nf_conntrack_netlink 57344 0 2025-12-04T09:24:10.4863459Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-12-04T09:24:10.4864327Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-12-04T09:24:10.4864794Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-12-04T09:24:10.4865234Z xfrm_user 57344 1 2025-12-04T09:24:10.4865650Z xfrm_algo 16384 1 xfrm_user 2025-12-04T09:24:10.4866108Z xt_addrtype 16384 2 2025-12-04T09:24:10.4866468Z nft_compat 20480 4 2025-12-04T09:24:10.4866909Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-12-04T09:24:10.4867499Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-12-04T09:24:10.4868054Z br_netfilter 36864 0 2025-12-04T09:24:10.4868441Z bridge 323584 1 br_netfilter 2025-12-04T09:24:10.4868872Z stp 16384 1 bridge 2025-12-04T09:24:10.4869281Z llc 16384 2 bridge,stp 2025-12-04T09:24:10.4869693Z overlay 167936 0 2025-12-04T09:24:10.4870092Z tls 139264 0 2025-12-04T09:24:10.4870494Z nls_ascii 16384 1 2025-12-04T09:24:10.4870916Z nls_cp437 20480 1 2025-12-04T09:24:10.4871327Z vfat 24576 1 2025-12-04T09:24:10.4871742Z fat 86016 1 vfat 2025-12-04T09:24:10.4872179Z sunrpc 700416 1 2025-12-04T09:24:10.4872601Z ghash_clmulni_intel 16384 0 2025-12-04T09:24:10.4873026Z i8042 45056 0 2025-12-04T09:24:10.4873437Z serio 28672 3 i8042 2025-12-04T09:24:10.4873864Z ena 184320 0 2025-12-04T09:24:10.4874293Z button 24576 0 2025-12-04T09:24:10.4874711Z sch_fq_codel 20480 17 2025-12-04T09:24:10.4875085Z fuse 184320 1 2025-12-04T09:24:10.4875449Z loop 36864 0 2025-12-04T09:24:10.4875859Z dm_mod 188416 0 2025-12-04T09:24:10.4876285Z configfs 57344 1 2025-12-04T09:24:10.4876690Z dmi_sysfs 20480 0 2025-12-04T09:24:10.4877113Z crc32_pclmul 16384 0 2025-12-04T09:24:10.4877531Z crc32c_intel 24576 0 2025-12-04T09:24:10.4877935Z efivarfs 24576 1 2025-12-04T09:24:10.4878344Z + modinfo nvidia 2025-12-04T09:24:10.4879452Z filename: /lib/modules/6.1.150-174.273.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-12-04T09:24:10.4880357Z import_ns: DMA_BUF 2025-12-04T09:24:10.4880751Z alias: char-major-195-* 2025-12-04T09:24:10.4881186Z version: 580.82.07 2025-12-04T09:24:10.4881588Z supported: external 2025-12-04T09:24:10.4881983Z license: Dual MIT/GPL 2025-12-04T09:24:10.4882446Z firmware: nvidia/580.82.07/gsp_tu10x.bin 2025-12-04T09:24:10.4883000Z firmware: nvidia/580.82.07/gsp_ga10x.bin 2025-12-04T09:24:10.4883522Z srcversion: BA7240A71DCF7DC6FE88C1D 2025-12-04T09:24:10.4884063Z alias: of:N*T*Cnvidia,tegra264-displayC* 2025-12-04T09:24:10.4884643Z alias: of:N*T*Cnvidia,tegra264-display 2025-12-04T09:24:10.4885215Z alias: of:N*T*Cnvidia,tegra234-displayC* 2025-12-04T09:24:10.4885792Z alias: of:N*T*Cnvidia,tegra234-display 2025-12-04T09:24:10.4886413Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-12-04T09:24:10.4887144Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-12-04T09:24:10.4887703Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-12-04T09:24:10.4888213Z depends: i2c-core,drm 2025-12-04T09:24:10.4888633Z retpoline: Y 2025-12-04T09:24:10.4888976Z name: nvidia 2025-12-04T09:24:10.4889554Z vermagic: 6.1.150-174.273.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-12-04T09:24:10.4890327Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-12-04T09:24:10.4891061Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-12-04T09:24:10.4891750Z parm: NVreg_ResmanDebugLevel:int 2025-12-04T09:24:10.4892267Z parm: NVreg_RmLogonRC:int 2025-12-04T09:24:10.4892749Z parm: NVreg_ModifyDeviceFiles:int 2025-12-04T09:24:10.4893417Z parm: NVreg_DeviceFileUID:int 2025-12-04T09:24:10.4893916Z parm: NVreg_DeviceFileGID:int 2025-12-04T09:24:10.4894416Z parm: NVreg_DeviceFileMode:int 2025-12-04T09:24:10.4895005Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-12-04T09:24:10.4895640Z parm: NVreg_UsePageAttributeTable:int 2025-12-04T09:24:10.4896196Z parm: NVreg_EnablePCIeGen3:int 2025-12-04T09:24:10.4896687Z parm: NVreg_EnableMSI:int 2025-12-04T09:24:10.4897179Z parm: NVreg_EnableStreamMemOPs:int 2025-12-04T09:24:10.4897771Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-12-04T09:24:10.4898422Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-12-04T09:24:10.4899034Z parm: NVreg_EnableS0ixPowerManagement:int 2025-12-04T09:24:10.4899707Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-12-04T09:24:10.4900619Z parm: NVreg_DynamicPowerManagement:int 2025-12-04T09:24:10.4901309Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-12-04T09:24:10.4901985Z parm: NVreg_EnableGpuFirmware:int 2025-12-04T09:24:10.4902543Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-12-04T09:24:10.4903143Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-12-04T09:24:10.4903737Z parm: NVreg_EnableUserNUMAManagement:int 2025-12-04T09:24:10.4904291Z parm: NVreg_MemoryPoolSize:int 2025-12-04T09:24:10.4904818Z parm: NVreg_KMallocHeapMaxSize:int 2025-12-04T09:24:10.4905353Z parm: NVreg_VMallocHeapMaxSize:int 2025-12-04T09:24:10.4905930Z parm: NVreg_IgnoreMMIOCheck:int 2025-12-04T09:24:10.4906445Z parm: NVreg_NvLinkDisable:int 2025-12-04T09:24:10.4906999Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-12-04T09:24:10.4907594Z parm: NVreg_RegisterPCIDriver:int 2025-12-04T09:24:10.4908173Z parm: NVreg_RegisterPlatformDeviceDriver:int 2025-12-04T09:24:10.4908763Z parm: NVreg_EnableResizableBar:int 2025-12-04T09:24:10.4909315Z parm: NVreg_EnableDbgBreakpoint:int 2025-12-04T09:24:10.4909880Z parm: NVreg_EnableNonblockingOpen:int 2025-12-04T09:24:10.4910470Z parm: NVreg_CoherentGPUMemoryMode:charp 2025-12-04T09:24:10.4911023Z parm: NVreg_RegistryDwords:charp 2025-12-04T09:24:10.4911589Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-12-04T09:24:10.4912136Z parm: NVreg_RmMsg:charp 2025-12-04T09:24:10.4912604Z parm: NVreg_GpuBlacklist:charp 2025-12-04T09:24:10.4913135Z parm: NVreg_TemporaryFilePath:charp 2025-12-04T09:24:10.4913666Z parm: NVreg_ExcludedGpus:charp 2025-12-04T09:24:10.4914177Z parm: NVreg_DmaRemapPeerMmio:int 2025-12-04T09:24:10.4914721Z parm: NVreg_RmNvlinkBandwidth:charp 2025-12-04T09:24:10.4915306Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-12-04T09:24:10.4915934Z parm: NVreg_ImexChannelCount:int 2025-12-04T09:24:10.4916476Z parm: NVreg_CreateImexChannel0:int 2025-12-04T09:24:10.4917042Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-12-04T09:24:10.4917602Z parm: rm_firmware_active:charp 2025-12-04T09:24:10.4918310Z + HAS_NVIDIA_DRIVER=0 2025-12-04T09:24:10.4918584Z ++ command -v nvidia-smi 2025-12-04T09:24:10.4918849Z + '[' -x /usr/bin/nvidia-smi ']' 2025-12-04T09:24:10.4919104Z + set +e 2025-12-04T09:24:10.4919564Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-12-04T09:24:12.2132311Z + INSTALLED_DRIVER_VERSION=580.82.07 2025-12-04T09:24:12.2132808Z + NVIDIA_SMI_STATUS=0 2025-12-04T09:24:12.2133152Z + '[' 0 -ne 0 ']' 2025-12-04T09:24:12.2133385Z + '[' 580.82.07 '!=' 580.82.07 ']' 2025-12-04T09:24:12.2133648Z + HAS_NVIDIA_DRIVER=1 2025-12-04T09:24:12.2134082Z + echo 'NVIDIA driver (580.82.07) has already been installed. Skipping NVIDIA driver installation' 2025-12-04T09:24:12.2134552Z + set -e 2025-12-04T09:24:12.2134747Z + '[' 1 -eq 0 ']' 2025-12-04T09:24:12.2135504Z NVIDIA driver (580.82.07) has already been installed. Skipping NVIDIA driver installation 2025-12-04T09:24:12.2135965Z + post_install_nvidia_driver_common 2025-12-04T09:24:12.2140823Z + sudo modprobe nvidia 2025-12-04T09:24:12.3119847Z + echo 'After installing NVIDIA driver' 2025-12-04T09:24:12.3120149Z + lspci 2025-12-04T09:24:12.3120373Z After installing NVIDIA driver 2025-12-04T09:24:12.3243722Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-12-04T09:24:12.3244225Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-12-04T09:24:12.3244777Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-12-04T09:24:12.3245305Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-12-04T09:24:12.3245786Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-12-04T09:24:12.3246318Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-12-04T09:24:12.3246811Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-12-04T09:24:12.3247293Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-12-04T09:24:12.3247700Z + lsmod 2025-12-04T09:24:12.3284321Z Module Size Used by 2025-12-04T09:24:12.3284612Z nvidia_uvm 1925120 0 2025-12-04T09:24:12.3285064Z nvidia 14286848 1 nvidia_uvm 2025-12-04T09:24:12.3285411Z drm 602112 1 nvidia 2025-12-04T09:24:12.3285721Z drm_panel_orientation_quirks 32768 1 drm 2025-12-04T09:24:12.3286037Z backlight 24576 1 drm 2025-12-04T09:24:12.3286333Z i2c_core 110592 2 nvidia,drm 2025-12-04T09:24:12.3286625Z xt_conntrack 16384 1 2025-12-04T09:24:12.3286993Z nft_chain_nat 16384 3 2025-12-04T09:24:12.3287294Z xt_MASQUERADE 20480 1 2025-12-04T09:24:12.3287591Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-12-04T09:24:12.3287933Z nf_conntrack_netlink 57344 0 2025-12-04T09:24:12.3288350Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-12-04T09:24:12.3288803Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-12-04T09:24:12.3289124Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-12-04T09:24:12.3289424Z xfrm_user 57344 1 2025-12-04T09:24:12.3289704Z xfrm_algo 16384 1 xfrm_user 2025-12-04T09:24:12.3289989Z xt_addrtype 16384 2 2025-12-04T09:24:12.3290259Z nft_compat 20480 4 2025-12-04T09:24:12.3290571Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-12-04T09:24:12.3290981Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-12-04T09:24:12.3291370Z br_netfilter 36864 0 2025-12-04T09:24:12.3291652Z bridge 323584 1 br_netfilter 2025-12-04T09:24:12.3291952Z stp 16384 1 bridge 2025-12-04T09:24:12.3292233Z llc 16384 2 bridge,stp 2025-12-04T09:24:12.3292523Z overlay 167936 0 2025-12-04T09:24:12.3292783Z tls 139264 0 2025-12-04T09:24:12.3293042Z nls_ascii 16384 1 2025-12-04T09:24:12.3293292Z nls_cp437 20480 1 2025-12-04T09:24:12.3293977Z vfat 24576 1 2025-12-04T09:24:12.3294249Z fat 86016 1 vfat 2025-12-04T09:24:12.3294514Z sunrpc 700416 1 2025-12-04T09:24:12.3294773Z ghash_clmulni_intel 16384 0 2025-12-04T09:24:12.3295163Z i8042 45056 0 2025-12-04T09:24:12.3295479Z serio 28672 3 i8042 2025-12-04T09:24:12.3296096Z ena 184320 0 2025-12-04T09:24:12.3296406Z button 24576 0 2025-12-04T09:24:12.3296737Z sch_fq_codel 20480 17 2025-12-04T09:24:12.3297184Z fuse 184320 1 2025-12-04T09:24:12.3297488Z loop 36864 0 2025-12-04T09:24:12.3297817Z dm_mod 188416 0 2025-12-04T09:24:12.3298252Z configfs 57344 1 2025-12-04T09:24:12.3298720Z dmi_sysfs 20480 0 2025-12-04T09:24:12.3299081Z crc32_pclmul 16384 0 2025-12-04T09:24:12.3299534Z crc32c_intel 24576 0 2025-12-04T09:24:12.3299904Z efivarfs 24576 1 2025-12-04T09:24:12.3300195Z + modinfo nvidia 2025-12-04T09:24:12.3310856Z filename: /lib/modules/6.1.150-174.273.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-12-04T09:24:12.3311492Z import_ns: DMA_BUF 2025-12-04T09:24:12.3311903Z alias: char-major-195-* 2025-12-04T09:24:12.3312222Z version: 580.82.07 2025-12-04T09:24:12.3312572Z supported: external 2025-12-04T09:24:12.3312959Z license: Dual MIT/GPL 2025-12-04T09:24:12.3313291Z firmware: nvidia/580.82.07/gsp_tu10x.bin 2025-12-04T09:24:12.3313733Z firmware: nvidia/580.82.07/gsp_ga10x.bin 2025-12-04T09:24:12.3314233Z srcversion: BA7240A71DCF7DC6FE88C1D 2025-12-04T09:24:12.3314673Z alias: of:N*T*Cnvidia,tegra264-displayC* 2025-12-04T09:24:12.3315081Z alias: of:N*T*Cnvidia,tegra264-display 2025-12-04T09:24:12.3315602Z alias: of:N*T*Cnvidia,tegra234-displayC* 2025-12-04T09:24:12.3316055Z alias: of:N*T*Cnvidia,tegra234-display 2025-12-04T09:24:12.3316450Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-12-04T09:24:12.3316940Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-12-04T09:24:12.3317377Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-12-04T09:24:12.3317810Z depends: i2c-core,drm 2025-12-04T09:24:12.3318125Z retpoline: Y 2025-12-04T09:24:12.3318450Z name: nvidia 2025-12-04T09:24:12.3318934Z vermagic: 6.1.150-174.273.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-12-04T09:24:12.3319628Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-12-04T09:24:12.3320186Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-12-04T09:24:12.3320741Z parm: NVreg_ResmanDebugLevel:int 2025-12-04T09:24:12.3321171Z parm: NVreg_RmLogonRC:int 2025-12-04T09:24:12.3321530Z parm: NVreg_ModifyDeviceFiles:int 2025-12-04T09:24:12.3321969Z parm: NVreg_DeviceFileUID:int 2025-12-04T09:24:12.3322390Z parm: NVreg_DeviceFileGID:int 2025-12-04T09:24:12.3322747Z parm: NVreg_DeviceFileMode:int 2025-12-04T09:24:12.3323225Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-12-04T09:24:12.3323727Z parm: NVreg_UsePageAttributeTable:int 2025-12-04T09:24:12.3324110Z parm: NVreg_EnablePCIeGen3:int 2025-12-04T09:24:12.3324531Z parm: NVreg_EnableMSI:int 2025-12-04T09:24:12.3325003Z parm: NVreg_EnableStreamMemOPs:int 2025-12-04T09:24:12.3325446Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-12-04T09:24:12.3325981Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-12-04T09:24:12.3326484Z parm: NVreg_EnableS0ixPowerManagement:int 2025-12-04T09:24:12.3326973Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-12-04T09:24:12.3327473Z parm: NVreg_DynamicPowerManagement:int 2025-12-04T09:24:12.3328006Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-12-04T09:24:12.3328660Z parm: NVreg_EnableGpuFirmware:int 2025-12-04T09:24:12.3329179Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-12-04T09:24:12.3329598Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-12-04T09:24:12.3330051Z parm: NVreg_EnableUserNUMAManagement:int 2025-12-04T09:24:12.3330603Z parm: NVreg_MemoryPoolSize:int 2025-12-04T09:24:12.3330977Z parm: NVreg_KMallocHeapMaxSize:int 2025-12-04T09:24:12.3331415Z parm: NVreg_VMallocHeapMaxSize:int 2025-12-04T09:24:12.3331879Z parm: NVreg_IgnoreMMIOCheck:int 2025-12-04T09:24:12.3332271Z parm: NVreg_NvLinkDisable:int 2025-12-04T09:24:12.3332698Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-12-04T09:24:12.3333196Z parm: NVreg_RegisterPCIDriver:int 2025-12-04T09:24:12.3333626Z parm: NVreg_RegisterPlatformDeviceDriver:int 2025-12-04T09:24:12.3334194Z parm: NVreg_EnableResizableBar:int 2025-12-04T09:24:12.3334682Z parm: NVreg_EnableDbgBreakpoint:int 2025-12-04T09:24:12.3335143Z parm: NVreg_EnableNonblockingOpen:int 2025-12-04T09:24:12.3335600Z parm: NVreg_CoherentGPUMemoryMode:charp 2025-12-04T09:24:12.3336140Z parm: NVreg_RegistryDwords:charp 2025-12-04T09:24:12.3336596Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-12-04T09:24:12.3337016Z parm: NVreg_RmMsg:charp 2025-12-04T09:24:12.3337446Z parm: NVreg_GpuBlacklist:charp 2025-12-04T09:24:12.3337880Z parm: NVreg_TemporaryFilePath:charp 2025-12-04T09:24:12.3338275Z parm: NVreg_ExcludedGpus:charp 2025-12-04T09:24:12.3338715Z parm: NVreg_DmaRemapPeerMmio:int 2025-12-04T09:24:12.3339160Z parm: NVreg_RmNvlinkBandwidth:charp 2025-12-04T09:24:12.3339587Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-12-04T09:24:12.3340122Z parm: NVreg_ImexChannelCount:int 2025-12-04T09:24:12.3340501Z parm: NVreg_CreateImexChannel0:int 2025-12-04T09:24:12.3340962Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-12-04T09:24:12.3341493Z parm: rm_firmware_active:charp 2025-12-04T09:24:12.3341838Z + set +e 2025-12-04T09:24:12.3342101Z + nvidia-smi 2025-12-04T09:24:13.7826071Z Thu Dec 4 09:24:13 2025 2025-12-04T09:24:13.7826633Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:24:13.7827430Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:24:13.7828008Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:24:13.7828617Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:24:13.7829334Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:24:13.7829854Z | | | MIG M. | 2025-12-04T09:24:13.7830304Z |=========================================+========================+======================| 2025-12-04T09:24:13.7914943Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-12-04T09:24:13.7915483Z | 0% 27C P0 59W / 300W | 0MiB / 23028MiB | 4% Default | 2025-12-04T09:24:13.7915937Z | | | N/A | 2025-12-04T09:24:13.7916511Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:24:13.7916885Z 2025-12-04T09:24:13.7917123Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:24:13.7917713Z | Processes: | 2025-12-04T09:24:13.7918271Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:24:13.7918800Z | ID ID Usage | 2025-12-04T09:24:13.7919719Z |=========================================================================================| 2025-12-04T09:24:13.7920726Z | No running processes found | 2025-12-04T09:24:13.7921315Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:24:14.2155942Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-12-04T09:24:15.6667978Z NVIDIA A10G 2025-12-04T09:24:15.9390609Z + NVIDIA_SMI_STATUS=0 2025-12-04T09:24:15.9390985Z + '[' 0 -eq 0 ']' 2025-12-04T09:24:15.9391453Z + echo 'INFO: Ignoring allowed status 0' 2025-12-04T09:24:15.9391829Z + set -e 2025-12-04T09:24:15.9392171Z INFO: Ignoring allowed status 0 2025-12-04T09:24:15.9401453Z == Installing nvidia container toolkit for amzn2023 == 2025-12-04T09:24:15.9405158Z + sudo yum install -y yum-utils 2025-12-04T09:24:16.4042933Z Last metadata expiration check: 0:08:46 ago on Thu Dec 4 09:15:30 2025. 2025-12-04T09:24:16.4322341Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-12-04T09:24:16.4913381Z Dependencies resolved. 2025-12-04T09:24:16.5207752Z Nothing to do. 2025-12-04T09:24:16.5208154Z Complete! 2025-12-04T09:24:16.6342330Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-12-04T09:24:16.6343179Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:24:16.6344474Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:24:16.9427287Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:24:16.9949215Z + sudo yum install -y nvidia-container-toolkit-1.17.8 libnvidia-container-tools-1.17.8 libnvidia-container1-1.17.8 nvidia-container-toolkit-base-1.17.8 2025-12-04T09:24:17.5407257Z nvidia-container-toolkit 19 kB/s | 833 B 00:00 2025-12-04T09:24:17.6249852Z Dependencies resolved. 2025-12-04T09:24:17.6535699Z ================================================================================ 2025-12-04T09:24:17.6536237Z Package Arch Version Repository Size 2025-12-04T09:24:17.6536729Z ================================================================================ 2025-12-04T09:24:17.6537184Z Downgrading: 2025-12-04T09:24:17.6537659Z libnvidia-container-tools x86_64 1.17.8-1 nvidia-container-toolkit 40 k 2025-12-04T09:24:17.6538340Z libnvidia-container1 x86_64 1.17.8-1 nvidia-container-toolkit 1.0 M 2025-12-04T09:24:17.6539042Z nvidia-container-toolkit x86_64 1.17.8-1 nvidia-container-toolkit 1.2 M 2025-12-04T09:24:17.6539720Z nvidia-container-toolkit-base x86_64 1.17.8-1 nvidia-container-toolkit 5.8 M 2025-12-04T09:24:17.6540150Z 2025-12-04T09:24:17.6540275Z Transaction Summary 2025-12-04T09:24:17.6540679Z ================================================================================ 2025-12-04T09:24:17.6541151Z Downgrade 4 Packages 2025-12-04T09:24:17.6541357Z 2025-12-04T09:24:17.6541487Z Total download size: 8.0 M 2025-12-04T09:24:17.6541926Z Downloading Packages: 2025-12-04T09:24:17.6970242Z (1/4): libnvidia-container-tools-1.17.8-1.x86_6 972 kB/s | 40 kB 00:00 2025-12-04T09:24:17.7518569Z (2/4): libnvidia-container1-1.17.8-1.x86_64.rpm 10 MB/s | 1.0 MB 00:00 2025-12-04T09:24:17.8075979Z (3/4): nvidia-container-toolkit-1.17.8-1.x86_64 8.1 MB/s | 1.2 MB 00:00 2025-12-04T09:24:17.9457594Z (4/4): nvidia-container-toolkit-base-1.17.8-1.x 23 MB/s | 5.8 MB 00:00 2025-12-04T09:24:17.9465194Z -------------------------------------------------------------------------------- 2025-12-04T09:24:17.9468214Z Total 27 MB/s | 8.0 MB 00:00 2025-12-04T09:24:17.9470824Z Running transaction check 2025-12-04T09:24:17.9588259Z Transaction check succeeded. 2025-12-04T09:24:17.9588646Z Running transaction test 2025-12-04T09:24:18.0090289Z Transaction test succeeded. 2025-12-04T09:24:18.0092701Z Running transaction 2025-12-04T09:24:18.8605034Z Preparing : 1/1 2025-12-04T09:24:18.9993658Z Downgrading : nvidia-container-toolkit-base-1.17.8-1.x86_64 1/8 2025-12-04T09:24:19.0261492Z Downgrading : libnvidia-container1-1.17.8-1.x86_64 2/8 2025-12-04T09:24:19.0980778Z Running scriptlet: libnvidia-container1-1.17.8-1.x86_64 2/8 2025-12-04T09:24:19.2288288Z Downgrading : libnvidia-container-tools-1.17.8-1.x86_64 3/8 2025-12-04T09:24:19.2569186Z Downgrading : nvidia-container-toolkit-1.17.8-1.x86_64 4/8 2025-12-04T09:24:19.3447094Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 4/8 2025-12-04T09:24:19.3526717Z Running scriptlet: nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:24:19.3527323Z Cleanup : nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:24:19.3865510Z Running scriptlet: nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:24:19.3937753Z Running scriptlet: libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:24:19.3938446Z Cleanup : libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:24:19.4314083Z Running scriptlet: libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:24:19.4395022Z Running scriptlet: libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:24:19.4395723Z Cleanup : libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:24:19.4797546Z Running scriptlet: libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:24:19.4866834Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:24:19.4867610Z Cleanup : nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:24:19.5218171Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:24:19.5892004Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 8/8 2025-12-04T09:24:44.1239578Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:24:44.1241536Z Verifying : libnvidia-container-tools-1.17.8-1.x86_64 1/8 2025-12-04T09:24:44.1242134Z Verifying : libnvidia-container-tools-1.18.1-1.x86_64 2/8 2025-12-04T09:24:44.1242755Z Verifying : libnvidia-container1-1.17.8-1.x86_64 3/8 2025-12-04T09:24:44.1243356Z Verifying : libnvidia-container1-1.18.1-1.x86_64 4/8 2025-12-04T09:24:44.1243988Z Verifying : nvidia-container-toolkit-1.17.8-1.x86_64 5/8 2025-12-04T09:24:44.1244590Z Verifying : nvidia-container-toolkit-1.18.1-1.x86_64 6/8 2025-12-04T09:24:44.1245211Z Verifying : nvidia-container-toolkit-base-1.17.8-1.x86_64 7/8 2025-12-04T09:24:44.3132752Z Verifying : nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8================================================================================ 2025-12-04T09:24:44.3133328Z WARNING: 2025-12-04T09:24:44.3133579Z A newer release of "Amazon Linux" is available. 2025-12-04T09:24:44.3133812Z 2025-12-04T09:24:44.3133906Z Available Versions: 2025-12-04T09:24:44.3134067Z 2025-12-04T09:24:44.3134157Z Version 2023.9.20250929: 2025-12-04T09:24:44.3134467Z Run the following command to upgrade to 2023.9.20250929: 2025-12-04T09:24:44.3134718Z 2025-12-04T09:24:44.3134840Z dnf upgrade --releasever=2023.9.20250929 2025-12-04T09:24:44.3135060Z 2025-12-04T09:24:44.3135160Z Release notes: 2025-12-04T09:24:44.3135579Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20250929.html 2025-12-04T09:24:44.3135957Z 2025-12-04T09:24:44.3136358Z Version 2023.9.20251014: 2025-12-04T09:24:44.3136676Z Run the following command to upgrade to 2023.9.20251014: 2025-12-04T09:24:44.3136934Z 2025-12-04T09:24:44.3137056Z dnf upgrade --releasever=2023.9.20251014 2025-12-04T09:24:44.3137267Z 2025-12-04T09:24:44.3137366Z Release notes: 2025-12-04T09:24:44.3137770Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251014.html 2025-12-04T09:24:44.3138135Z 2025-12-04T09:24:44.3138228Z Version 2023.9.20251020: 2025-12-04T09:24:44.3138543Z Run the following command to upgrade to 2023.9.20251020: 2025-12-04T09:24:44.3138793Z 2025-12-04T09:24:44.3138918Z dnf upgrade --releasever=2023.9.20251020 2025-12-04T09:24:44.3139131Z 2025-12-04T09:24:44.3139217Z Release notes: 2025-12-04T09:24:44.3139795Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251020.html 2025-12-04T09:24:44.3140171Z 2025-12-04T09:24:44.3140263Z Version 2023.9.20251027: 2025-12-04T09:24:44.3140579Z Run the following command to upgrade to 2023.9.20251027: 2025-12-04T09:24:44.3140833Z 2025-12-04T09:24:44.3140951Z dnf upgrade --releasever=2023.9.20251027 2025-12-04T09:24:44.3141177Z 2025-12-04T09:24:44.3141263Z Release notes: 2025-12-04T09:24:44.3141662Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251027.html 2025-12-04T09:24:44.3142029Z 2025-12-04T09:24:44.3142130Z Version 2023.9.20251105: 2025-12-04T09:24:44.3142437Z Run the following command to upgrade to 2023.9.20251105: 2025-12-04T09:24:44.3142697Z 2025-12-04T09:24:44.3142815Z dnf upgrade --releasever=2023.9.20251105 2025-12-04T09:24:44.3143024Z 2025-12-04T09:24:44.3143120Z Release notes: 2025-12-04T09:24:44.3143512Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251105.html 2025-12-04T09:24:44.3143898Z 2025-12-04T09:24:44.3144012Z Version 2023.9.20251110: 2025-12-04T09:24:44.3144350Z Run the following command to upgrade to 2023.9.20251110: 2025-12-04T09:24:44.3144607Z 2025-12-04T09:24:44.3144734Z dnf upgrade --releasever=2023.9.20251110 2025-12-04T09:24:44.3144943Z 2025-12-04T09:24:44.3145033Z Release notes: 2025-12-04T09:24:44.3145431Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251110.html 2025-12-04T09:24:44.3145804Z 2025-12-04T09:24:44.3145904Z Version 2023.9.20251117: 2025-12-04T09:24:44.3146212Z Run the following command to upgrade to 2023.9.20251117: 2025-12-04T09:24:44.3146467Z 2025-12-04T09:24:44.3146584Z dnf upgrade --releasever=2023.9.20251117 2025-12-04T09:24:44.3146803Z 2025-12-04T09:24:44.3146895Z Release notes: 2025-12-04T09:24:44.3147300Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251117.html 2025-12-04T09:24:44.3147668Z 2025-12-04T09:24:44.3147786Z ================================================================================ 2025-12-04T09:24:44.3716610Z 2025-12-04T09:24:44.3716851Z 2025-12-04T09:24:44.3717017Z Downgraded: 2025-12-04T09:24:44.3717727Z libnvidia-container-tools-1.17.8-1.x86_64 2025-12-04T09:24:44.3718801Z libnvidia-container1-1.17.8-1.x86_64 2025-12-04T09:24:44.3719991Z nvidia-container-toolkit-1.17.8-1.x86_64 2025-12-04T09:24:44.3721122Z nvidia-container-toolkit-base-1.17.8-1.x86_64 2025-12-04T09:24:44.3721791Z 2025-12-04T09:24:44.3721971Z Complete! 2025-12-04T09:24:44.4478927Z + sudo systemctl restart docker 2025-12-04T09:24:50.9017931Z Thu Dec 4 09:24:50 2025 2025-12-04T09:24:50.9018342Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:24:50.9018843Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:24:50.9019371Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:24:50.9020126Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:24:50.9020671Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:24:50.9021114Z | | | MIG M. | 2025-12-04T09:24:50.9021448Z |=========================================+========================+======================| 2025-12-04T09:24:50.9113166Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-12-04T09:24:50.9113609Z | 0% 27C P0 60W / 300W | 0MiB / 23028MiB | 4% Default | 2025-12-04T09:24:50.9113996Z | | | N/A | 2025-12-04T09:24:50.9114598Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:24:50.9114918Z 2025-12-04T09:24:50.9115326Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:24:50.9115801Z | Processes: | 2025-12-04T09:24:50.9116257Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:24:50.9116669Z | ID ID Usage | 2025-12-04T09:24:50.9117015Z |=========================================================================================| 2025-12-04T09:24:50.9118776Z | No running processes found | 2025-12-04T09:24:50.9119304Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:24:51.0873648Z Unable to find image 'public.ecr.aws/docker/library/python:3.13' locally 2025-12-04T09:24:51.2939910Z 3.13: Pulling from docker/library/python 2025-12-04T09:24:51.3890348Z 53c88f1dfeb7: Pulling fs layer 2025-12-04T09:24:51.3890694Z eae668646f44: Pulling fs layer 2025-12-04T09:24:51.3890968Z ff2e6e687b6c: Pulling fs layer 2025-12-04T09:24:51.3891229Z 7c40a3faff76: Pulling fs layer 2025-12-04T09:24:51.3891495Z 967a3b1c8fef: Pulling fs layer 2025-12-04T09:24:51.3891817Z a64e1a44f22a: Pulling fs layer 2025-12-04T09:24:51.3892165Z 52655f8a5bcc: Pulling fs layer 2025-12-04T09:24:51.3892539Z 7c40a3faff76: Waiting 2025-12-04T09:24:51.3892819Z 967a3b1c8fef: Waiting 2025-12-04T09:24:51.3893038Z a64e1a44f22a: Waiting 2025-12-04T09:24:51.3893262Z 52655f8a5bcc: Waiting 2025-12-04T09:24:51.4922868Z eae668646f44: Verifying Checksum 2025-12-04T09:24:51.4923314Z eae668646f44: Download complete 2025-12-04T09:24:51.5688094Z 53c88f1dfeb7: Verifying Checksum 2025-12-04T09:24:51.5688591Z 53c88f1dfeb7: Download complete 2025-12-04T09:24:51.6442474Z 967a3b1c8fef: Verifying Checksum 2025-12-04T09:24:51.6442976Z 967a3b1c8fef: Download complete 2025-12-04T09:24:51.7567041Z a64e1a44f22a: Download complete 2025-12-04T09:24:51.7879923Z 52655f8a5bcc: Verifying Checksum 2025-12-04T09:24:51.7880393Z 52655f8a5bcc: Download complete 2025-12-04T09:24:51.7939048Z ff2e6e687b6c: Verifying Checksum 2025-12-04T09:24:51.7939479Z ff2e6e687b6c: Download complete 2025-12-04T09:24:52.2446577Z 7c40a3faff76: Verifying Checksum 2025-12-04T09:24:52.2447050Z 7c40a3faff76: Download complete 2025-12-04T09:24:53.3905784Z 53c88f1dfeb7: Pull complete 2025-12-04T09:24:54.1248415Z eae668646f44: Pull complete 2025-12-04T09:24:56.6754242Z ff2e6e687b6c: Pull complete 2025-12-04T09:25:03.4484525Z 7c40a3faff76: Pull complete 2025-12-04T09:25:03.7404784Z 967a3b1c8fef: Pull complete 2025-12-04T09:25:04.5307536Z a64e1a44f22a: Pull complete 2025-12-04T09:25:04.5538418Z 52655f8a5bcc: Pull complete 2025-12-04T09:25:04.5673144Z Digest: sha256:3f986299a7b8b44b0d8cf9bda2b22361ce5c3058ef5d7cb17fb7452506680ab0 2025-12-04T09:25:04.5715339Z Status: Downloaded newer image for public.ecr.aws/docker/library/python:3.13 2025-12-04T09:25:11.7014062Z Thu Dec 4 09:25:11 2025 2025-12-04T09:25:11.7014797Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:25:11.7015307Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:25:11.7015793Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:25:11.7016289Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:25:11.7016813Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:25:11.7017250Z | | | MIG M. | 2025-12-04T09:25:11.7017589Z |=========================================+========================+======================| 2025-12-04T09:25:11.7176260Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-12-04T09:25:11.7176716Z | 0% 24C P8 10W / 300W | 0MiB / 23028MiB | 0% Default | 2025-12-04T09:25:11.7177105Z | | | N/A | 2025-12-04T09:25:11.7177502Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:25:11.7181340Z 2025-12-04T09:25:11.7181730Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:25:11.7182168Z | Processes: | 2025-12-04T09:25:11.7182611Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:25:11.7183023Z | ID ID Usage | 2025-12-04T09:25:11.7183380Z |=========================================================================================| 2025-12-04T09:25:11.7188135Z | No running processes found | 2025-12-04T09:25:11.7188621Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:25:13.4724244Z Command completed after 1 attempt(s). 2025-12-04T09:25:13.4827300Z Prepare all required actions 2025-12-04T09:25:13.4856608Z ##[group]Run ./.github/actions/get-workflow-job-id 2025-12-04T09:25:13.4856936Z with: 2025-12-04T09:25:13.4857498Z github-token: *** 2025-12-04T09:25:13.4857733Z env: 2025-12-04T09:25:13.4857940Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:13.4858198Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:13.4858498Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:13.4858837Z ##[endgroup] 2025-12-04T09:25:13.4874141Z ##[group]Run set -eux 2025-12-04T09:25:13.4874404Z set -eux 2025-12-04T09:25:13.4874828Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T09:25:13.4890134Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:13.4890518Z env: 2025-12-04T09:25:13.4890742Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:13.4891032Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:13.4891388Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:13.4891949Z GITHUB_TOKEN: *** 2025-12-04T09:25:13.4892169Z ##[endgroup] 2025-12-04T09:25:13.4946497Z + python3 .github/scripts/get_workflow_job_id.py 19922826259 i-0513695dee1ce902e 2025-12-04T09:25:15.2891979Z Setting output job-id=57118183207 2025-12-04T09:25:15.2892825Z Setting output job-name=linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:25:15.3009902Z ##[group]Run python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84 2025-12-04T09:25:15.3010612Z python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84 2025-12-04T09:25:15.3011516Z python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 & 2025-12-04T09:25:15.3012310Z echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:25:15.3022351Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:15.3022707Z env: 2025-12-04T09:25:15.3022921Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:15.3023186Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:15.3023482Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:15.3023827Z JOB_ID: 57118183207 2025-12-04T09:25:15.3024516Z JOB_NAME: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:25:15.3025226Z WORKFLOW_NAME: periodic 2025-12-04T09:25:15.3025688Z WORKFLOW_RUN_ID: 19922826259 2025-12-04T09:25:15.3025963Z MONITOR_LOG_INTERVAL: 5 2025-12-04T09:25:15.3026225Z MONITOR_DATA_COLLECT_INTERVAL: 1 2025-12-04T09:25:15.3026509Z ##[endgroup] 2025-12-04T09:25:15.5931356Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:25:15.9992510Z Collecting psutil==5.9.8 2025-12-04T09:25:16.0150010Z Downloading psutil-5.9.8-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288 kB) 2025-12-04T09:25:16.0916974Z Collecting dataclasses_json==0.6.7 2025-12-04T09:25:16.0944473Z Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB) 2025-12-04T09:25:16.1240627Z Collecting nvidia-ml-py==11.525.84 2025-12-04T09:25:16.1269580Z Downloading nvidia_ml_py-11.525.84-py3-none-any.whl (34 kB) 2025-12-04T09:25:16.1601781Z Collecting typing-inspect<1,>=0.4.0 2025-12-04T09:25:16.1629765Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-12-04T09:25:16.2832518Z Collecting marshmallow<4.0.0,>=3.18.0 2025-12-04T09:25:16.2866056Z Downloading marshmallow-3.26.1-py3-none-any.whl (50 kB) 2025-12-04T09:25:16.3461762Z Collecting packaging>=17.0 2025-12-04T09:25:16.3490508Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-12-04T09:25:16.4048936Z Collecting typing-extensions>=3.7.4 2025-12-04T09:25:16.4080080Z Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB) 2025-12-04T09:25:16.4282018Z Collecting mypy-extensions>=0.3.0 2025-12-04T09:25:16.4313134Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-12-04T09:25:16.5229471Z Installing collected packages: typing-extensions, packaging, mypy-extensions, typing-inspect, marshmallow, psutil, nvidia-ml-py, dataclasses-json 2025-12-04T09:25:16.7981092Z Successfully installed dataclasses-json-0.6.7 marshmallow-3.26.1 mypy-extensions-1.1.0 nvidia-ml-py-11.525.84 packaging-25.0 psutil-5.9.8 typing-extensions-4.15.0 typing-inspect-0.9.0 2025-12-04T09:25:16.9918711Z Prepare all required actions 2025-12-04T09:25:16.9919098Z Getting action download info 2025-12-04T09:25:17.1876520Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T09:25:17.5507223Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-12-04T09:25:17.9356028Z ##[group]Run ./.github/actions/download-build-artifacts 2025-12-04T09:25:17.9356405Z with: 2025-12-04T09:25:17.9356701Z name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck 2025-12-04T09:25:17.9357096Z s3-bucket: gha-artifacts 2025-12-04T09:25:17.9357431Z env: 2025-12-04T09:25:17.9357701Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:17.9357951Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:17.9358250Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:17.9358573Z ##[endgroup] 2025-12-04T09:25:17.9389123Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:25:17.9389484Z with: 2025-12-04T09:25:17.9389760Z name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck 2025-12-04T09:25:17.9390136Z s3-bucket: gha-artifacts 2025-12-04T09:25:17.9390399Z region: us-east-1 2025-12-04T09:25:17.9390613Z env: 2025-12-04T09:25:17.9390813Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:17.9391062Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:17.9391361Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:17.9391698Z ##[endgroup] 2025-12-04T09:25:18.4171946Z (node:59444) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:25:18.4172511Z 2025-12-04T09:25:18.4172702Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:25:18.4173218Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:25:18.4173794Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:25:18.7160941Z Found 1 objects with prefix pytorch/pytorch/19922826259/linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck/ 2025-12-04T09:25:18.7162026Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:25:26.0614575Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:25:26.0620216Z Artifact download has finished successfully 2025-12-04T09:25:26.0999593Z ##[group]Run unzip -o artifacts.zip 2025-12-04T09:25:26.0999929Z unzip -o artifacts.zip 2025-12-04T09:25:26.1009181Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:26.1009536Z env: 2025-12-04T09:25:26.1009751Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:26.1010013Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:26.1010313Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:26.1010665Z ##[endgroup] 2025-12-04T09:25:26.1086982Z Archive: artifacts.zip 2025-12-04T09:25:26.1088713Z creating: dist/ 2025-12-04T09:25:28.1720969Z inflating: dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:25:28.1857619Z inflating: dist/.ninja_log 2025-12-04T09:25:28.1858200Z creating: build/custom_test_artifacts/ 2025-12-04T09:25:28.1858842Z creating: build/custom_test_artifacts/custom-op-build/ 2025-12-04T09:25:28.1859517Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2025-12-04T09:25:28.1860420Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:25:28.1869394Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:25:28.1870251Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/ 2025-12-04T09:25:28.1871106Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:25:28.1872028Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:25:28.1873119Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:25:28.1875573Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:25:28.1877296Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:25:28.1878689Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:25:28.1879752Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:25:28.1880684Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:25:28.1883681Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:25:28.1885458Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:25:28.1887052Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:25:28.1889543Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:25:28.1892227Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:25:28.1893287Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:25:28.1894079Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:25:28.1954371Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:25:28.2015732Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:25:28.2017294Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:25:28.2082853Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:25:28.2084424Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:25:28.2085707Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:25:28.2087036Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:25:28.2088487Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:25:28.2090018Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:25:28.2091388Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:25:28.2092729Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:25:28.2094323Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:25:28.2095529Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:25:28.2096704Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:25:28.2098103Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:25:28.2099581Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:25:28.2101318Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:25:28.2104629Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:25:28.2180154Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:25:28.2181204Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:25:28.2257497Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:25:28.2258512Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:25:28.2259260Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:25:28.2260103Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2025-12-04T09:25:28.2260911Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2025-12-04T09:25:28.2261891Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2025-12-04T09:25:28.2262960Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2025-12-04T09:25:28.2264008Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2025-12-04T09:25:28.2264971Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2025-12-04T09:25:28.2266006Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2025-12-04T09:25:28.2267191Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2025-12-04T09:25:28.2268383Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2025-12-04T09:25:28.2269407Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2025-12-04T09:25:28.2270845Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2025-12-04T09:25:28.2293115Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2025-12-04T09:25:28.2497631Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2025-12-04T09:25:28.2498569Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2025-12-04T09:25:28.2499615Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2025-12-04T09:25:28.2500936Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2025-12-04T09:25:28.2502066Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2025-12-04T09:25:28.2503083Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2025-12-04T09:25:28.2504148Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2025-12-04T09:25:28.2505163Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2025-12-04T09:25:28.2506534Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2025-12-04T09:25:28.2507621Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2025-12-04T09:25:28.2509086Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2025-12-04T09:25:28.2531403Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2025-12-04T09:25:28.2615317Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2025-12-04T09:25:28.2616687Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:25:28.2617752Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:25:28.2618956Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2025-12-04T09:25:28.2620251Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2025-12-04T09:25:28.2622689Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2025-12-04T09:25:28.2623543Z inflating: build/custom_test_artifacts/custom-op-build/detect_cuda_version.cc 2025-12-04T09:25:28.2627513Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2025-12-04T09:25:28.2628638Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2025-12-04T09:25:28.2629958Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2025-12-04T09:25:28.2806192Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2025-12-04T09:25:28.2864719Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2025-12-04T09:25:28.2865764Z creating: build/custom_test_artifacts/jit-hook-build/ 2025-12-04T09:25:28.2866307Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2025-12-04T09:25:28.2866831Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:25:28.2874119Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:25:28.2874956Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/ 2025-12-04T09:25:28.2875816Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:25:28.2876730Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:25:28.2877585Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:25:28.2880537Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:25:28.2882182Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:25:28.2883577Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:25:28.2884521Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:25:28.2885429Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:25:28.2888435Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:25:28.2890215Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:25:28.2891815Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:25:28.2894361Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:25:28.2896813Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:25:28.2897847Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:25:28.2898631Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:25:28.2959096Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:25:28.3020592Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:25:28.3021905Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:25:28.3087533Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:25:28.3088893Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:25:28.3090069Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:25:28.3091555Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:25:28.3092893Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:25:28.3094101Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:25:28.3095486Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:25:28.3096977Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:25:28.3098753Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:25:28.3099895Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:25:28.3101296Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:25:28.3102775Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:25:28.3104255Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:25:28.3106591Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:25:28.3109700Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:25:28.3184545Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:25:28.3186253Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:25:28.3261968Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:25:28.3262971Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:25:28.3263710Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:25:28.3264544Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2025-12-04T09:25:28.3265406Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2025-12-04T09:25:28.3266413Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2025-12-04T09:25:28.3267562Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2025-12-04T09:25:28.3268606Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2025-12-04T09:25:28.3269585Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2025-12-04T09:25:28.3270616Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2025-12-04T09:25:28.3271945Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2025-12-04T09:25:28.3273168Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2025-12-04T09:25:28.3274228Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2025-12-04T09:25:28.3276570Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2025-12-04T09:25:28.3298799Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2025-12-04T09:25:28.3363850Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2025-12-04T09:25:28.3365164Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:25:28.3366489Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:25:28.3367424Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2025-12-04T09:25:28.3368486Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2025-12-04T09:25:28.3370851Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2025-12-04T09:25:28.3371710Z inflating: build/custom_test_artifacts/jit-hook-build/detect_cuda_version.cc 2025-12-04T09:25:28.3375089Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2025-12-04T09:25:28.3376292Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2025-12-04T09:25:28.3377549Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2025-12-04T09:25:28.3417786Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2025-12-04T09:25:28.3418485Z creating: build/custom_test_artifacts/custom-backend-build/ 2025-12-04T09:25:28.3419173Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2025-12-04T09:25:28.3419952Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:25:28.3428264Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:25:28.3429163Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/ 2025-12-04T09:25:28.3430119Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:25:28.3431290Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:25:28.3443151Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:25:28.3444530Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:25:28.3445676Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:25:28.3446428Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:25:28.3447262Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:25:28.3448273Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:25:28.3449383Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:25:28.3450222Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:25:28.3450991Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:25:28.3451821Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:25:28.3452707Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:25:28.3453496Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:25:28.3454208Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:25:28.3510570Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:25:28.3571598Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:25:28.3572950Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:25:28.3638807Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:25:28.3640304Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:25:28.3641534Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:25:28.3642896Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:25:28.3644426Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:25:28.3645868Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:25:28.3647365Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:25:28.3648649Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:25:28.3650503Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:25:28.3652208Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:25:28.3653418Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:25:28.3654722Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:25:28.3656221Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:25:28.3657629Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:25:28.3661041Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:25:28.3736881Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:25:28.3737989Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:25:28.3813864Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:25:28.3814967Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:25:28.3815808Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:25:28.3816681Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2025-12-04T09:25:28.3817544Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2025-12-04T09:25:28.3818563Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2025-12-04T09:25:28.3819789Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2025-12-04T09:25:28.3820964Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2025-12-04T09:25:28.3822233Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2025-12-04T09:25:28.3823386Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2025-12-04T09:25:28.3824649Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2025-12-04T09:25:28.3825813Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2025-12-04T09:25:28.3826889Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2025-12-04T09:25:28.3828376Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2025-12-04T09:25:28.3833756Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2025-12-04T09:25:28.3957356Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2025-12-04T09:25:28.3959733Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2025-12-04T09:25:28.3962010Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2025-12-04T09:25:28.3964523Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2025-12-04T09:25:28.3966140Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2025-12-04T09:25:28.3967183Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2025-12-04T09:25:28.3968298Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2025-12-04T09:25:28.3969152Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2025-12-04T09:25:28.3970010Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2025-12-04T09:25:28.3971060Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2025-12-04T09:25:28.3971906Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2025-12-04T09:25:28.3989857Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2025-12-04T09:25:28.4046793Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2025-12-04T09:25:28.4048208Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:25:28.4049336Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:25:28.4050332Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2025-12-04T09:25:28.4051519Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2025-12-04T09:25:28.4054010Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2025-12-04T09:25:28.4054930Z inflating: build/custom_test_artifacts/custom-backend-build/detect_cuda_version.cc 2025-12-04T09:25:28.4058296Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2025-12-04T09:25:28.4059481Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2025-12-04T09:25:28.4060718Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2025-12-04T09:25:28.4165068Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2025-12-04T09:25:28.4205935Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2025-12-04T09:25:28.4206566Z creating: build/lib/ 2025-12-04T09:25:28.4289842Z inflating: build/lib/libprotobuf-lite.a 2025-12-04T09:25:28.4736215Z inflating: build/lib/libprotobuf.a 2025-12-04T09:25:28.5234457Z inflating: build/lib/libprotoc.a 2025-12-04T09:25:28.5244704Z inflating: build/lib/libpthreadpool.a 2025-12-04T09:25:28.5253454Z inflating: build/lib/libcpuinfo.a 2025-12-04T09:25:28.5261844Z inflating: build/lib/libcpuinfo_internals.a 2025-12-04T09:25:28.5262956Z inflating: build/lib/libclog.a 2025-12-04T09:25:28.5282893Z inflating: build/lib/libpytorch_qnnpack.a 2025-12-04T09:25:28.5286107Z inflating: build/lib/libnnpack_reference_layers.a 2025-12-04T09:25:28.5304897Z inflating: build/lib/libnnpack.a 2025-12-04T09:25:28.5492985Z inflating: build/lib/libmicrokernels-prod.a 2025-12-04T09:25:28.6393326Z inflating: build/lib/libmicrokernels-all.a 2025-12-04T09:25:28.6465116Z inflating: build/lib/libgtest.a 2025-12-04T09:25:28.6483201Z inflating: build/lib/libgmock.a 2025-12-04T09:25:28.6484190Z inflating: build/lib/libgtest_main.a 2025-12-04T09:25:28.6486050Z inflating: build/lib/libgmock_main.a 2025-12-04T09:25:28.6579002Z inflating: build/lib/libXNNPACK.a 2025-12-04T09:25:28.6656390Z inflating: build/lib/libbenchmark.a 2025-12-04T09:25:28.6657397Z inflating: build/lib/libbenchmark_main.a 2025-12-04T09:25:28.6659147Z inflating: build/lib/libjitprofiling.a 2025-12-04T09:25:28.6726470Z inflating: build/lib/libasmjit.a 2025-12-04T09:25:28.6735016Z inflating: build/lib/libittnotify.a 2025-12-04T09:25:28.7934921Z inflating: build/lib/libfbgemm.a 2025-12-04T09:25:28.7966298Z inflating: build/lib/libtensorpipe_uv.a 2025-12-04T09:25:28.8517395Z inflating: build/lib/libtensorpipe.a 2025-12-04T09:25:28.8764838Z inflating: build/lib/libtensorpipe_cuda.a 2025-12-04T09:25:28.8900601Z inflating: build/lib/libgloo.a 2025-12-04T09:25:28.8949470Z inflating: build/lib/libonnx_proto.a 2025-12-04T09:25:28.9396837Z inflating: build/lib/libgloo_cuda.a 2025-12-04T09:25:29.0119225Z inflating: build/lib/libonnx.a 2025-12-04T09:25:30.0434496Z inflating: build/lib/libdnnl.a 2025-12-04T09:25:30.0453942Z inflating: build/lib/libfmt.a 2025-12-04T09:25:30.0935429Z inflating: build/lib/libkineto.a 2025-12-04T09:25:30.1052575Z inflating: build/lib/libc10.so 2025-12-04T09:25:30.1102515Z inflating: build/lib/libc10_cuda.so 2025-12-04T09:25:30.1104873Z inflating: build/lib/libcaffe2_nvrtc.so 2025-12-04T09:25:30.1106829Z inflating: build/lib/libtorch_global_deps.so 2025-12-04T09:25:33.2388069Z inflating: build/lib/libtorch_cpu.so 2025-12-04T09:25:33.3160518Z inflating: build/lib/libtorch_nvshmem.so 2025-12-04T09:25:35.2984745Z inflating: build/lib/libtorch_cuda.so 2025-12-04T09:25:35.2986645Z inflating: build/lib/libtorch.so 2025-12-04T09:25:35.3039341Z inflating: build/lib/libtorch_cuda_linalg.so 2025-12-04T09:25:35.3111951Z inflating: build/lib/libtorchbind_test.so 2025-12-04T09:25:35.3131437Z inflating: build/lib/libjitbackend_test.so 2025-12-04T09:25:35.3155665Z inflating: build/lib/libbackend_with_compiler.so 2025-12-04T09:25:35.3182761Z inflating: build/lib/libaoti_custom_ops.so 2025-12-04T09:25:35.3186055Z inflating: build/lib/libc10d_cuda_test.so 2025-12-04T09:25:35.3191661Z inflating: build/lib/libshm.so 2025-12-04T09:25:35.5596997Z inflating: build/lib/libtorch_python.so 2025-12-04T09:25:35.5633875Z inflating: build/lib/libnnapi_backend.so 2025-12-04T09:25:35.5634199Z creating: build/bin/ 2025-12-04T09:25:35.6094532Z inflating: build/bin/protoc-3.13.0.0 2025-12-04T09:25:35.6553826Z inflating: build/bin/protoc 2025-12-04T09:25:35.6613942Z inflating: build/bin/c10_AllocatorConfig_test 2025-12-04T09:25:35.6670066Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2025-12-04T09:25:35.6728141Z inflating: build/bin/c10_DeviceGuard_test 2025-12-04T09:25:35.6786157Z inflating: build/bin/c10_Device_test 2025-12-04T09:25:35.6852711Z inflating: build/bin/c10_DispatchKeySet_test 2025-12-04T09:25:35.6908401Z inflating: build/bin/c10_StreamGuard_test 2025-12-04T09:25:35.6968652Z inflating: build/bin/c10_Scalar_test 2025-12-04T09:25:35.7031375Z inflating: build/bin/c10_SizesAndStrides_test 2025-12-04T09:25:35.7092503Z inflating: build/bin/c10_InlineDeviceGuard_test 2025-12-04T09:25:35.7155654Z inflating: build/bin/c10_SymInt_test 2025-12-04T09:25:35.7218775Z inflating: build/bin/c10_InlineStreamGuard_test 2025-12-04T09:25:35.7274040Z inflating: build/bin/c10_ArrayRef_test 2025-12-04T09:25:35.7352260Z inflating: build/bin/c10_cow_test 2025-12-04T09:25:35.7407643Z inflating: build/bin/c10_ConstexprCrc_test 2025-12-04T09:25:35.7463967Z inflating: build/bin/c10_DeadlockDetection_test 2025-12-04T09:25:35.7522996Z inflating: build/bin/c10_Bitset_test 2025-12-04T09:25:35.7586662Z inflating: build/bin/c10_Enumerate_test 2025-12-04T09:25:35.7646267Z inflating: build/bin/c10_IntrusiveList_test 2025-12-04T09:25:35.7703794Z inflating: build/bin/c10_Half_test 2025-12-04T09:25:35.7765764Z inflating: build/bin/c10_LeftRight_test 2025-12-04T09:25:35.7825517Z inflating: build/bin/c10_NetworkFlow_test 2025-12-04T09:25:35.7881459Z inflating: build/bin/c10_Semaphore_test 2025-12-04T09:25:35.7937928Z inflating: build/bin/c10_Synchronized_test 2025-12-04T09:25:35.7999951Z inflating: build/bin/c10_ThreadLocal_test 2025-12-04T09:25:35.8058859Z inflating: build/bin/c10_TypeIndex_test 2025-12-04T09:25:35.8116622Z inflating: build/bin/c10_accumulate_test 2025-12-04T09:25:35.8179470Z inflating: build/bin/c10_bfloat16_test 2025-12-04T09:25:35.8236615Z inflating: build/bin/c10_bit_cast_test 2025-12-04T09:25:35.8298240Z inflating: build/bin/c10_complex_test 2025-12-04T09:25:35.8362134Z inflating: build/bin/c10_complex_math_test 2025-12-04T09:25:35.8418113Z inflating: build/bin/c10_error_test 2025-12-04T09:25:35.8476713Z inflating: build/bin/c10_exception_test 2025-12-04T09:25:35.8533608Z inflating: build/bin/c10_flags_test 2025-12-04T09:25:35.8590865Z inflating: build/bin/c10_generic_math_test 2025-12-04T09:25:35.8647963Z inflating: build/bin/c10_irange_test 2025-12-04T09:25:35.8707755Z inflating: build/bin/c10_lazy_test 2025-12-04T09:25:35.8877329Z inflating: build/bin/c10_intrusive_ptr_test 2025-12-04T09:25:35.8941138Z inflating: build/bin/c10_logging_test 2025-12-04T09:25:35.8997223Z inflating: build/bin/c10_nofatal_test 2025-12-04T09:25:35.9080007Z inflating: build/bin/c10_optional_test 2025-12-04T09:25:35.9139706Z inflating: build/bin/c10_registry_test 2025-12-04T09:25:35.9208487Z inflating: build/bin/c10_ordered_preserving_dict_test 2025-12-04T09:25:35.9373686Z inflating: build/bin/c10_small_vector_test 2025-12-04T09:25:35.9436752Z inflating: build/bin/c10_string_util_test 2025-12-04T09:25:35.9494689Z inflating: build/bin/c10_ssize_test 2025-12-04T09:25:35.9551102Z inflating: build/bin/c10_tempfile_test 2025-12-04T09:25:35.9606798Z inflating: build/bin/c10_string_view_test 2025-12-04T09:25:35.9669950Z inflating: build/bin/c10_typeid_test 2025-12-04T09:25:35.9719448Z inflating: build/bin/c10_intrusive_ptr_benchmark 2025-12-04T09:25:35.9779345Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2025-12-04T09:25:35.9838697Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_stream 2025-12-04T09:25:35.9897279Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_from_2_processes 2025-12-04T09:25:35.9956634Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_1_var_test 2025-12-04T09:25:36.0012822Z inflating: build/bin/c10_cuda_CUDATest 2025-12-04T09:25:36.0072354Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T09:25:36.0131750Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T09:25:36.0190632Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2025-12-04T09:25:36.0807831Z inflating: build/bin/vec_test_all_types_DEFAULT 2025-12-04T09:25:36.1442547Z inflating: build/bin/vec_test_all_types_AVX512 2025-12-04T09:25:36.2086953Z inflating: build/bin/vec_test_all_types_AVX2 2025-12-04T09:25:36.2193364Z inflating: build/bin/test_aoti_abi_check 2025-12-04T09:25:36.2249325Z inflating: build/bin/test_vec_half_DEFAULT 2025-12-04T09:25:36.2305527Z inflating: build/bin/test_vec_half_AVX512 2025-12-04T09:25:36.2361728Z inflating: build/bin/test_vec_half_AVX2 2025-12-04T09:25:36.2442818Z inflating: build/bin/Dict_test 2025-12-04T09:25:36.2501891Z inflating: build/bin/Dimname_test 2025-12-04T09:25:36.2574072Z inflating: build/bin/MaybeOwned_test 2025-12-04T09:25:36.2637645Z inflating: build/bin/NamedTensor_test 2025-12-04T09:25:36.2703824Z inflating: build/bin/apply_utils_test 2025-12-04T09:25:36.2768794Z inflating: build/bin/atest 2025-12-04T09:25:36.2839848Z inflating: build/bin/basic 2025-12-04T09:25:36.2901376Z inflating: build/bin/broadcast_test 2025-12-04T09:25:36.2958079Z inflating: build/bin/cpu_allocator_test 2025-12-04T09:25:36.3023684Z inflating: build/bin/cpu_generator_test 2025-12-04T09:25:36.3082273Z inflating: build/bin/cpu_profiling_allocator_test 2025-12-04T09:25:36.3183338Z inflating: build/bin/cpu_rng_test 2025-12-04T09:25:36.3240638Z inflating: build/bin/dlconvertor_test 2025-12-04T09:25:36.3304666Z inflating: build/bin/extension_backend_test 2025-12-04T09:25:36.3366590Z inflating: build/bin/half_test 2025-12-04T09:25:36.3472247Z inflating: build/bin/ivalue_test 2025-12-04T09:25:36.3528269Z inflating: build/bin/lazy_tensor_test 2025-12-04T09:25:36.3587471Z inflating: build/bin/math_kernel_test 2025-12-04T09:25:36.3647189Z inflating: build/bin/memory_format_test 2025-12-04T09:25:36.3707136Z inflating: build/bin/memory_overlapping_test 2025-12-04T09:25:36.3766683Z inflating: build/bin/mobile_memory_cleanup 2025-12-04T09:25:36.3829836Z inflating: build/bin/native_test 2025-12-04T09:25:36.3886382Z inflating: build/bin/operator_name_test 2025-12-04T09:25:36.3943627Z inflating: build/bin/operators_test 2025-12-04T09:25:36.4002275Z inflating: build/bin/packedtensoraccessor_test 2025-12-04T09:25:36.4076975Z inflating: build/bin/pow_test 2025-12-04T09:25:36.4140095Z inflating: build/bin/quantized_test 2025-12-04T09:25:36.4196459Z inflating: build/bin/reduce_ops_test 2025-12-04T09:25:36.4253862Z inflating: build/bin/reportMemoryUsage_test 2025-12-04T09:25:36.4316472Z inflating: build/bin/scalar_tensor_test 2025-12-04T09:25:36.4380050Z inflating: build/bin/scalar_test 2025-12-04T09:25:36.4438019Z inflating: build/bin/StorageUtils_test 2025-12-04T09:25:36.4496384Z inflating: build/bin/stride_properties_test 2025-12-04T09:25:36.4583580Z inflating: build/bin/tensor_iterator_test 2025-12-04T09:25:36.4644374Z inflating: build/bin/test_parallel 2025-12-04T09:25:36.4700926Z inflating: build/bin/thread_init_test 2025-12-04T09:25:36.4762710Z inflating: build/bin/type_ptr_test 2025-12-04T09:25:36.4828668Z inflating: build/bin/type_test 2025-12-04T09:25:36.4887346Z inflating: build/bin/undefined_tensor_test 2025-12-04T09:25:36.4943463Z inflating: build/bin/verify_api_visibility 2025-12-04T09:25:36.5021589Z inflating: build/bin/legacy_vmap_test 2025-12-04T09:25:36.5078814Z inflating: build/bin/weakref_test 2025-12-04T09:25:36.5136787Z inflating: build/bin/wrapdim_test 2025-12-04T09:25:36.5194503Z inflating: build/bin/xla_tensor_test 2025-12-04T09:25:36.5260585Z inflating: build/bin/IListRef_test 2025-12-04T09:25:36.5375503Z inflating: build/bin/List_test 2025-12-04T09:25:36.5448728Z inflating: build/bin/KernelFunction_test 2025-12-04T09:25:36.5578871Z inflating: build/bin/kernel_function_legacy_test 2025-12-04T09:25:36.5683052Z inflating: build/bin/kernel_function_test 2025-12-04T09:25:36.5820268Z inflating: build/bin/kernel_lambda_legacy_test 2025-12-04T09:25:36.5930631Z inflating: build/bin/kernel_lambda_test 2025-12-04T09:25:36.5997075Z inflating: build/bin/kernel_stackbased_test 2025-12-04T09:25:36.6101110Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2025-12-04T09:25:36.6158516Z inflating: build/bin/CppSignature_test 2025-12-04T09:25:36.6220294Z inflating: build/bin/backend_fallback_test 2025-12-04T09:25:36.6275822Z inflating: build/bin/op_allowlist_test 2025-12-04T09:25:36.6605509Z inflating: build/bin/op_registration_test 2025-12-04T09:25:36.6678639Z inflating: build/bin/inline_container_test 2025-12-04T09:25:36.6738796Z inflating: build/bin/cuda_allocator_test 2025-12-04T09:25:36.6798001Z inflating: build/bin/cuda_apply_test 2025-12-04T09:25:36.6865511Z inflating: build/bin/cuda_atomic_ops_test 2025-12-04T09:25:36.6928490Z inflating: build/bin/cuda_caching_host_allocator_test 2025-12-04T09:25:36.7006077Z inflating: build/bin/cuda_complex_math_test 2025-12-04T09:25:36.7072232Z inflating: build/bin/cuda_complex_test 2025-12-04T09:25:36.7141924Z inflating: build/bin/cuda_cub_test 2025-12-04T09:25:36.7201590Z inflating: build/bin/cuda_cublas_handle_pool_test 2025-12-04T09:25:36.7257452Z inflating: build/bin/cuda_device_test 2025-12-04T09:25:36.7329022Z inflating: build/bin/cuda_distributions_test 2025-12-04T09:25:36.7388917Z inflating: build/bin/cuda_event_test 2025-12-04T09:25:36.7447161Z inflating: build/bin/cuda_dlconvertor_test 2025-12-04T09:25:36.7503425Z inflating: build/bin/cuda_exchange_device_test 2025-12-04T09:25:36.7562013Z inflating: build/bin/cuda_reportMemoryUsage_test 2025-12-04T09:25:36.7618315Z inflating: build/bin/cuda_allocatorTraceTracker_test 2025-12-04T09:25:36.7676200Z inflating: build/bin/cuda_integer_divider_test 2025-12-04T09:25:36.7744326Z inflating: build/bin/cuda_stream_test 2025-12-04T09:25:36.7800054Z inflating: build/bin/cuda_cudnn_test 2025-12-04T09:25:36.7856491Z inflating: build/bin/cuda_half_test 2025-12-04T09:25:36.7919870Z inflating: build/bin/cuda_generator_test 2025-12-04T09:25:36.7975875Z inflating: build/bin/cuda_optional_test 2025-12-04T09:25:36.8034402Z inflating: build/bin/cuda_packedtensoraccessor_test 2025-12-04T09:25:36.8093460Z inflating: build/bin/cuda_vectorized_test 2025-12-04T09:25:36.9231596Z inflating: build/bin/test_jit 2025-12-04T09:25:36.9290516Z inflating: build/bin/BackoffTest 2025-12-04T09:25:36.9350325Z inflating: build/bin/FileStoreTest 2025-12-04T09:25:36.9718822Z inflating: build/bin/test_lazy 2025-12-04T09:25:36.9782057Z inflating: build/bin/TCPStoreTest 2025-12-04T09:25:36.9842451Z inflating: build/bin/HashStoreTest 2025-12-04T09:25:36.9856973Z inflating: build/bin/ProcessGroupMPITest 2025-12-04T09:25:36.9860562Z inflating: build/bin/example_allreduce 2025-12-04T09:25:36.9935195Z inflating: build/bin/ProcessGroupGlooTest 2025-12-04T09:25:36.9998487Z inflating: build/bin/ProcessGroupGlooAsyncTest 2025-12-04T09:25:37.0069559Z inflating: build/bin/ProcessGroupNCCLTest 2025-12-04T09:25:37.0138271Z inflating: build/bin/ProcessGroupNCCLErrorsTest 2025-12-04T09:25:37.0199828Z inflating: build/bin/test_dist_autograd 2025-12-04T09:25:37.0275764Z inflating: build/bin/test_cpp_rpc 2025-12-04T09:25:37.0278271Z inflating: build/bin/parallel_benchmark 2025-12-04T09:25:37.1498890Z inflating: build/bin/test_api 2025-12-04T09:25:37.1503577Z inflating: build/bin/torch_shm_manager 2025-12-04T09:25:37.1504075Z creating: .additional_ci_files/ 2025-12-04T09:25:37.1570002Z inflating: .additional_ci_files/test-times.json 2025-12-04T09:25:37.1808195Z inflating: .additional_ci_files/test-class-times.json 2025-12-04T09:25:37.1860744Z ##[group]Run rm artifacts.zip 2025-12-04T09:25:37.1861029Z rm artifacts.zip 2025-12-04T09:25:37.1870613Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:37.1870979Z env: 2025-12-04T09:25:37.1871420Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:37.1871702Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:37.1872024Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:37.1872365Z ##[endgroup] 2025-12-04T09:25:37.4694669Z ##[group]Run df -H 2025-12-04T09:25:37.4694925Z df -H 2025-12-04T09:25:37.4704895Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:37.4705297Z env: 2025-12-04T09:25:37.4705521Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:37.4705785Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:37.4706086Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:37.4706425Z ##[endgroup] 2025-12-04T09:25:37.4759435Z Filesystem Size Used Avail Use% Mounted on 2025-12-04T09:25:37.4759904Z devtmpfs 4.2M 0 4.2M 0% /dev 2025-12-04T09:25:37.4760213Z tmpfs 34G 0 34G 0% /dev/shm 2025-12-04T09:25:37.4760528Z tmpfs 14G 562k 14G 1% /run 2025-12-04T09:25:37.4760830Z /dev/nvme0n1p1 161G 54G 108G 34% / 2025-12-04T09:25:37.4761137Z tmpfs 34G 17k 34G 1% /tmp 2025-12-04T09:25:37.4761469Z /dev/nvme0n1p128 11M 1.4M 9.2M 13% /boot/efi 2025-12-04T09:25:37.4761802Z tmpfs 6.7G 0 6.7G 0% /run/user/0 2025-12-04T09:25:37.4797329Z Prepare all required actions 2025-12-04T09:25:37.4798431Z Getting action download info 2025-12-04T09:25:37.6613598Z ##[group]Run ./.github/actions/download-td-artifacts 2025-12-04T09:25:37.6613932Z with: 2025-12-04T09:25:37.6614125Z env: 2025-12-04T09:25:37.6614324Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:37.6614581Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:37.6614882Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:37.6615215Z ##[endgroup] 2025-12-04T09:25:37.6734606Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:25:37.6734931Z with: 2025-12-04T09:25:37.6735132Z name: td_results 2025-12-04T09:25:37.6735370Z s3-bucket: gha-artifacts 2025-12-04T09:25:37.6735625Z region: us-east-1 2025-12-04T09:25:37.6735832Z env: 2025-12-04T09:25:37.6736037Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:37.6736463Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:37.6736764Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:37.6737315Z ##[endgroup] 2025-12-04T09:25:38.2339817Z (node:59468) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:25:38.2340360Z 2025-12-04T09:25:38.2340550Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:25:38.2341069Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:25:38.2341597Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:25:38.3307643Z Found 1 objects with prefix pytorch/pytorch/19922826259/td_results/ 2025-12-04T09:25:38.3308260Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:25:38.3861998Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:25:38.3867338Z Artifact download has finished successfully 2025-12-04T09:25:38.4211173Z ##[group]Run mkdir -p .additional_ci_files 2025-12-04T09:25:38.4211555Z mkdir -p .additional_ci_files 2025-12-04T09:25:38.4211971Z mv td_results.json .additional_ci_files/td_results.json || true 2025-12-04T09:25:38.4221253Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:38.4221601Z env: 2025-12-04T09:25:38.4232169Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:38.4232495Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:38.4232812Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:38.4233167Z ##[endgroup] 2025-12-04T09:25:38.4341324Z ##[group]Run .github/scripts/parse_ref.py 2025-12-04T09:25:38.4341694Z .github/scripts/parse_ref.py 2025-12-04T09:25:38.4350661Z shell: /usr/bin/bash -e {0} 2025-12-04T09:25:38.4350910Z env: 2025-12-04T09:25:38.4351114Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:38.4351371Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:38.4351666Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:38.4352017Z ##[endgroup] 2025-12-04T09:25:38.4583483Z Setting output branch=main 2025-12-04T09:25:38.4706971Z Prepare all required actions 2025-12-04T09:25:38.4707341Z Getting action download info 2025-12-04T09:25:38.6256684Z ##[group]Run ./.github/actions/filter-test-configs 2025-12-04T09:25:38.6257019Z with: 2025-12-04T09:25:38.6257377Z github-token: *** 2025-12-04T09:25:38.6265542Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:25:38.6274171Z job-name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:25:38.6274855Z env: 2025-12-04T09:25:38.6275067Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:38.6275325Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:38.6275620Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:38.6275963Z ##[endgroup] 2025-12-04T09:25:38.6324666Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T09:25:38.6325009Z with: 2025-12-04T09:25:38.6325214Z shell: bash 2025-12-04T09:25:38.6325431Z timeout_minutes: 10 2025-12-04T09:25:38.6325672Z max_attempts: 5 2025-12-04T09:25:38.6325909Z retry_wait_seconds: 30 2025-12-04T09:25:38.6326660Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2 python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:25:38.6327637Z polling_interval_seconds: 1 2025-12-04T09:25:38.6327917Z warning_on_retry: true 2025-12-04T09:25:38.6328166Z continue_on_error: false 2025-12-04T09:25:38.6328397Z env: 2025-12-04T09:25:38.6328597Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:38.6328845Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:38.6329138Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:38.6329611Z GITHUB_TOKEN: *** 2025-12-04T09:25:38.6329831Z ##[endgroup] 2025-12-04T09:25:38.7356360Z + python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:25:38.9733788Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:25:39.1002678Z Collecting requests==2.27.1 2025-12-04T09:25:39.1165689Z Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB) 2025-12-04T09:25:39.3160020Z Collecting pyyaml==6.0.2 2025-12-04T09:25:39.3196459Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB) 2025-12-04T09:25:39.7828817Z Collecting charset-normalizer~=2.0.0 2025-12-04T09:25:39.7863223Z Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB) 2025-12-04T09:25:39.7919438Z Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3.9/site-packages (from requests==2.27.1) (2.10) 2025-12-04T09:25:39.7923906Z Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3.9/site-packages (from requests==2.27.1) (1.25.10) 2025-12-04T09:25:39.8440536Z Collecting certifi>=2017.4.17 2025-12-04T09:25:39.8473485Z Downloading certifi-2025.11.12-py3-none-any.whl (159 kB) 2025-12-04T09:25:39.9356202Z Installing collected packages: charset-normalizer, certifi, requests, pyyaml 2025-12-04T09:25:40.0566763Z Successfully installed certifi-2025.11.12 charset-normalizer-2.0.12 pyyaml-6.0.2 requests-2.27.1 2025-12-04T09:25:40.7113586Z Command completed after 1 attempt(s). 2025-12-04T09:25:40.7182833Z ##[group]Run set -x 2025-12-04T09:25:40.7183261Z set -x 2025-12-04T09:25:40.7183496Z  2025-12-04T09:25:40.7183915Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:25:40.7184428Z # in runner workspace 2025-12-04T09:25:40.7184830Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2025-12-04T09:25:40.7193914Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:40.7194274Z env: 2025-12-04T09:25:40.7194479Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:40.7194740Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:40.7195041Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:40.7195373Z ##[endgroup] 2025-12-04T09:25:40.7226781Z + python3 /home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2025-12-04T09:25:40.7411154Z Setting output branch=main 2025-12-04T09:25:40.7467317Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:25:40.7467735Z echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:25:40.7468061Z echo "Job name: ${JOB_NAME}" 2025-12-04T09:25:40.7468348Z  2025-12-04T09:25:40.7468739Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:25:40.7469268Z # in runner workspace 2025-12-04T09:25:40.7469684Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2025-12-04T09:25:40.7470134Z  --workflow "${GITHUB_WORKFLOW}" \ 2025-12-04T09:25:40.7470455Z  --job-name "${JOB_NAME}" \ 2025-12-04T09:25:40.7478557Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}]}" \ 2025-12-04T09:25:40.7487041Z  --selected-test-configs "" \ 2025-12-04T09:25:40.7487363Z  --pr-number "${PR_NUMBER}" \ 2025-12-04T09:25:40.7487655Z  --tag "${TAG}" \ 2025-12-04T09:25:40.7487933Z  --event-name "${EVENT_NAME}" \ 2025-12-04T09:25:40.7488239Z  --schedule "${SCHEDULE}" \ 2025-12-04T09:25:40.7488530Z  --branch "${HEAD_BRANCH}" 2025-12-04T09:25:40.7497573Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:40.7497941Z env: 2025-12-04T09:25:40.7498149Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:40.7498409Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:40.7498797Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:40.7499351Z GITHUB_TOKEN: *** 2025-12-04T09:25:40.7500001Z JOB_NAME: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:25:40.7501120Z PR_NUMBER: 2025-12-04T09:25:40.7501346Z TAG: 2025-12-04T09:25:40.7501563Z EVENT_NAME: schedule 2025-12-04T09:25:40.7501811Z SCHEDULE: 29 8 * * * 2025-12-04T09:25:40.7502055Z HEAD_BRANCH: main 2025-12-04T09:25:40.7502282Z ##[endgroup] 2025-12-04T09:25:40.7531210Z Workflow: periodic 2025-12-04T09:25:40.7532038Z Job name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:25:40.9448767Z Setting output keep-going=True 2025-12-04T09:25:40.9449276Z Setting output ci-verbose-test-logs=False 2025-12-04T09:25:40.9449766Z Setting output ci-test-showlocals=False 2025-12-04T09:25:40.9450229Z Setting output ci-no-test-timeout=False 2025-12-04T09:25:40.9450708Z Setting output ci-no-td=False 2025-12-04T09:25:40.9451419Z Setting output ci-td-distributed=False 2025-12-04T09:25:40.9451875Z Setting output is-unstable=False 2025-12-04T09:25:40.9452301Z Setting output reenabled-issues= 2025-12-04T09:25:40.9480597Z Setting output test-matrix={"include": [{"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:25:40.9509858Z Setting output is-test-matrix-empty=False 2025-12-04T09:25:40.9589371Z ##[group]Run echo "Filtered matrix:" 2025-12-04T09:25:40.9589731Z echo "Filtered matrix:" 2025-12-04T09:25:40.9608007Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 8, "num_shards": 8, "runner": "linux.g5.4xlarge.nvidia.gpu", "owners": ["module:slowgradcheck"], "rerun_disabled_tests": "rerun_disabled_tests"}]}" 2025-12-04T09:25:40.9625769Z  2025-12-04T09:25:40.9625979Z echo 2025-12-04T09:25:40.9626246Z echo "Is the current job unstable? False" 2025-12-04T09:25:40.9626556Z  2025-12-04T09:25:40.9626757Z echo 2025-12-04T09:25:40.9627007Z echo "Is keep-going label set? True" 2025-12-04T09:25:40.9627300Z  2025-12-04T09:25:40.9627499Z echo 2025-12-04T09:25:40.9627734Z echo "Reenabled issues? " 2025-12-04T09:25:40.9636973Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:40.9637343Z env: 2025-12-04T09:25:40.9637558Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:40.9637811Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:40.9638118Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:40.9638466Z ##[endgroup] 2025-12-04T09:25:40.9667536Z Filtered matrix: 2025-12-04T09:25:40.9690037Z {include: [{config: default, shard: 1, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 1, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 6, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 6, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 6, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 6, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 7, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 7, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 7, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 7, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 8, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check}, {config: default, shard: 8, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 8, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 8, num_shards: 8, runner: linux.g5.4xlarge.nvidia.gpu, owners: [module:slowgradcheck], rerun_disabled_tests: rerun_disabled_tests}]} 2025-12-04T09:25:40.9708383Z 2025-12-04T09:25:40.9708497Z Is the current job unstable? False 2025-12-04T09:25:40.9708726Z 2025-12-04T09:25:40.9708854Z Is keep-going label set? True 2025-12-04T09:25:40.9709054Z 2025-12-04T09:25:40.9709146Z Reenabled issues? 2025-12-04T09:25:40.9751644Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:25:40.9752136Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:25:40.9760551Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:40.9760906Z env: 2025-12-04T09:25:40.9761112Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:40.9761366Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:40.9761659Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:40.9762001Z JOB_TIMEOUT: 300 2025-12-04T09:25:40.9762223Z ##[endgroup] 2025-12-04T09:25:40.9818845Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:25:40.9819336Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:25:40.9819770Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:25:40.9827943Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:25:40.9828308Z env: 2025-12-04T09:25:40.9828521Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:40.9828780Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:40.9829077Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:40.9829422Z ##[endgroup] 2025-12-04T09:25:40.9941646Z ##[group]Run set -x 2025-12-04T09:25:40.9941959Z set -x 2025-12-04T09:25:40.9942173Z  2025-12-04T09:25:40.9942416Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2025-12-04T09:25:40.9942790Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2025-12-04T09:25:40.9943167Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2025-12-04T09:25:40.9943515Z  TEST_COMMAND=.ci/onnx/test.sh 2025-12-04T09:25:40.9943792Z else 2025-12-04T09:25:40.9944039Z  TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:25:40.9944331Z fi 2025-12-04T09:25:40.9944526Z  2025-12-04T09:25:40.9944781Z # Leaving 1GB for the runner and other things 2025-12-04T09:25:40.9945334Z TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo) 2025-12-04T09:25:40.9946141Z # https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap 2025-12-04T09:25:40.9946808Z # comes from https://github.com/pytorch/test-infra/pull/6058 2025-12-04T09:25:40.9947313Z TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3)) 2025-12-04T09:25:40.9947706Z  2025-12-04T09:25:40.9947955Z if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then 2025-12-04T09:25:40.9948283Z  SHM_OPTS= 2025-12-04T09:25:40.9948523Z  JENKINS_USER= 2025-12-04T09:25:40.9948901Z  # ensure that docker container cleanly exits in 12 hours 2025-12-04T09:25:40.9949354Z  # if for some reason cleanup action doesn't stop container 2025-12-04T09:25:40.9949743Z  # when job is cancelled 2025-12-04T09:25:40.9950207Z  DOCKER_SHELL_CMD="sleep 12h" 2025-12-04T09:25:40.9950523Z  USED_IMAGE="${DOCKER_IMAGE_S390X}" 2025-12-04T09:25:40.9950822Z else 2025-12-04T09:25:40.9951071Z  SHM_OPTS="--shm-size=${SHM_SIZE}" 2025-12-04T09:25:40.9951390Z  JENKINS_USER="--user jenkins" 2025-12-04T09:25:40.9951696Z  DOCKER_SHELL_CMD= 2025-12-04T09:25:40.9951980Z  USED_IMAGE="${DOCKER_IMAGE}" 2025-12-04T09:25:40.9952257Z fi 2025-12-04T09:25:40.9952461Z  2025-12-04T09:25:40.9952779Z # detached container should get cleaned up by teardown_ec2_linux 2025-12-04T09:25:40.9953282Z # TODO: Stop building test binaries as part of the build phase 2025-12-04T09:25:40.9953845Z # Used for GPU_FLAG, SHM_OPTS, JENKINS_USER and DOCKER_SHELL_CMD since that doesn't play nice 2025-12-04T09:25:40.9954361Z # shellcheck disable=SC2086,SC2090 2025-12-04T09:25:40.9954684Z container_name=$(docker run \ 2025-12-04T09:25:40.9954982Z  ${GPU_FLAG:-} \ 2025-12-04T09:25:40.9955273Z  ${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \ 2025-12-04T09:25:40.9955604Z  -e BUILD_ENVIRONMENT \ 2025-12-04T09:25:40.9955886Z  -e PR_NUMBER \ 2025-12-04T09:25:40.9956148Z  -e GITHUB_ACTIONS \ 2025-12-04T09:25:40.9956430Z  -e GITHUB_REPOSITORY \ 2025-12-04T09:25:40.9956716Z  -e GITHUB_WORKFLOW \ 2025-12-04T09:25:40.9956991Z  -e GITHUB_JOB \ 2025-12-04T09:25:40.9957249Z  -e GITHUB_RUN_ID \ 2025-12-04T09:25:40.9957519Z  -e GITHUB_RUN_NUMBER \ 2025-12-04T09:25:40.9957794Z  -e GITHUB_RUN_ATTEMPT \ 2025-12-04T09:25:40.9958078Z  -e JOB_ID \ 2025-12-04T09:25:40.9958324Z  -e JOB_NAME \ 2025-12-04T09:25:40.9958577Z  -e BASE_SHA \ 2025-12-04T09:25:40.9958839Z  -e BRANCH \ 2025-12-04T09:25:40.9959098Z  -e SHA1 \ 2025-12-04T09:25:40.9959339Z  -e AWS_DEFAULT_REGION \ 2025-12-04T09:25:40.9959739Z  -e IN_WHEEL_TEST \ 2025-12-04T09:25:40.9960001Z  -e SHARD_NUMBER \ 2025-12-04T09:25:40.9960261Z  -e TEST_CONFIG \ 2025-12-04T09:25:40.9960514Z  -e NUM_TEST_SHARDS \ 2025-12-04T09:25:40.9960908Z  -e REENABLED_ISSUES \ 2025-12-04T09:25:40.9961198Z  -e CONTINUE_THROUGH_ERROR \ 2025-12-04T09:25:40.9961488Z  -e VERBOSE_TEST_LOGS \ 2025-12-04T09:25:40.9961770Z  -e TEST_SHOWLOCALS \ 2025-12-04T09:25:40.9962043Z  -e NO_TEST_TIMEOUT \ 2025-12-04T09:25:40.9962308Z  -e NO_TD \ 2025-12-04T09:25:40.9962544Z  -e TD_DISTRIBUTED \ 2025-12-04T09:25:40.9962813Z  -e PR_LABELS \ 2025-12-04T09:25:40.9963091Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2025-12-04T09:25:40.9963400Z  -e SCCACHE_BUCKET \ 2025-12-04T09:25:40.9963670Z  -e SCCACHE_REGION \ 2025-12-04T09:25:40.9963933Z  -e XLA_CUDA \ 2025-12-04T09:25:40.9964213Z  -e XLA_CLANG_CACHE_S3_BUCKET_NAME \ 2025-12-04T09:25:40.9964557Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2025-12-04T09:25:40.9964909Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2025-12-04T09:25:40.9965266Z  -e SKIP_SCCACHE_INITIALIZATION=1 \ 2025-12-04T09:25:40.9965585Z  -e HUGGING_FACE_HUB_TOKEN \ 2025-12-04T09:25:40.9965906Z  -e VLLM_TEST_HUGGING_FACE_TOKEN \ 2025-12-04T09:25:40.9966234Z  -e SCRIBE_GRAPHQL_ACCESS_TOKEN \ 2025-12-04T09:25:40.9966535Z  -e DASHBOARD_TAG \ 2025-12-04T09:25:40.9966811Z  -e ARTIFACTS_FILE_SUFFIX \ 2025-12-04T09:25:40.9967156Z  --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \ 2025-12-04T09:25:40.9967547Z  --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \ 2025-12-04T09:25:40.9967938Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T09:25:40.9968321Z  --security-opt seccomp=unconfined \ 2025-12-04T09:25:40.9968748Z  --cap-add=SYS_PTRACE \ 2025-12-04T09:25:40.9969060Z  --ipc=host \ 2025-12-04T09:25:40.9969308Z  ${SHM_OPTS} \ 2025-12-04T09:25:40.9969550Z  --tty \ 2025-12-04T09:25:40.9969772Z  --detach \ 2025-12-04T09:25:40.9970030Z  --name="${container_name}" \ 2025-12-04T09:25:40.9970325Z  ${JENKINS_USER} \ 2025-12-04T09:25:40.9970647Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2025-12-04T09:25:40.9971023Z  -w /var/lib/jenkins/workspace \ 2025-12-04T09:25:40.9971321Z  "${USED_IMAGE}" \ 2025-12-04T09:25:40.9971584Z  ${DOCKER_SHELL_CMD} 2025-12-04T09:25:40.9971829Z ) 2025-12-04T09:25:40.9972147Z echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}" 2025-12-04T09:25:40.9972532Z  2025-12-04T09:25:40.9972776Z if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then 2025-12-04T09:25:40.9973322Z  docker exec -t "${container_name}" sh -c "python3 -m pip install -r .ci/docker/requirements-ci.txt" 2025-12-04T09:25:40.9973818Z fi 2025-12-04T09:25:40.9974020Z  2025-12-04T09:25:40.9974476Z docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}" 2025-12-04T09:25:40.9983168Z shell: /usr/bin/bash -e {0} 2025-12-04T09:25:40.9983420Z env: 2025-12-04T09:25:40.9983621Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:25:40.9983874Z HAS_NVIDIA_GPU: true 2025-12-04T09:25:40.9984174Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:25:40.9984631Z BUILD_ENVIRONMENT: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck 2025-12-04T09:25:40.9985023Z PR_NUMBER: 2025-12-04T09:25:40.9985258Z GITHUB_REPOSITORY: pytorch/pytorch 2025-12-04T09:25:40.9985555Z GITHUB_WORKFLOW: periodic 2025-12-04T09:25:40.9985800Z GITHUB_JOB: test 2025-12-04T09:25:40.9986028Z GITHUB_RUN_ID: 19922826259 2025-12-04T09:25:40.9986285Z GITHUB_RUN_NUMBER: 19107 2025-12-04T09:25:40.9986535Z GITHUB_RUN_ATTEMPT: 1 2025-12-04T09:25:40.9986771Z JOB_ID: 57118183207 2025-12-04T09:25:40.9987537Z JOB_NAME: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:25:40.9988231Z BRANCH: main 2025-12-04T09:25:40.9988490Z SHA1: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:25:40.9988859Z BASE_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:25:40.9989187Z TEST_CONFIG: default 2025-12-04T09:25:40.9989412Z SHARD_NUMBER: 5 2025-12-04T09:25:40.9989630Z NUM_TEST_SHARDS: 8 2025-12-04T09:25:40.9989853Z EXTRA_FLAGS: 2025-12-04T09:25:40.9990065Z OP_BENCHMARK_TESTS: 2025-12-04T09:25:40.9990302Z REENABLED_ISSUES: 2025-12-04T09:25:40.9990543Z CONTINUE_THROUGH_ERROR: True 2025-12-04T09:25:40.9990803Z VERBOSE_TEST_LOGS: False 2025-12-04T09:25:40.9991060Z TEST_SHOWLOCALS: False 2025-12-04T09:25:40.9991307Z NO_TEST_TIMEOUT: False 2025-12-04T09:25:40.9991545Z NO_TD: False 2025-12-04T09:25:40.9991765Z TD_DISTRIBUTED: False 2025-12-04T09:25:40.9992065Z SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2 2025-12-04T09:25:40.9992399Z SCCACHE_REGION: us-east-1 2025-12-04T09:25:40.9992645Z SHM_SIZE: 2g 2025-12-04T09:25:40.9993385Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:25:40.9994687Z DOCKER_IMAGE_S390X: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:25:40.9995477Z XLA_CUDA: 2025-12-04T09:25:40.9995814Z XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:25:40.9996244Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2025-12-04T09:25:40.9996551Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 1 2025-12-04T09:25:40.9996828Z DASHBOARD_TAG: 2025-12-04T09:25:40.9997318Z VLLM_TEST_HUGGING_FACE_TOKEN: *** 2025-12-04T09:25:40.9997701Z HUGGING_FACE_HUB_TOKEN: *** 2025-12-04T09:25:40.9998087Z SCRIBE_GRAPHQL_ACCESS_TOKEN: *** 2025-12-04T09:25:40.9998518Z ARTIFACTS_FILE_SUFFIX: test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T09:25:40.9998993Z ##[endgroup] 2025-12-04T09:25:41.0029345Z + [[ default == \m\u\l\t\i\g\p\u ]] 2025-12-04T09:25:41.0029751Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *onnx* ]] 2025-12-04T09:25:41.0030171Z + TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:25:41.0033608Z ++ awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo 2025-12-04T09:25:41.0059674Z + TOTAL_AVAILABLE_MEMORY_IN_GB='61.094 ' 2025-12-04T09:25:41.0060479Z + TOTAL_MEMORY_WITH_SWAP=64 2025-12-04T09:25:41.0061312Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *\s\3\9\0\x* ]] 2025-12-04T09:25:41.0062124Z + SHM_OPTS=--shm-size=2g 2025-12-04T09:25:41.0062639Z + JENKINS_USER='--user jenkins' 2025-12-04T09:25:41.0063151Z + DOCKER_SHELL_CMD= 2025-12-04T09:25:41.0075982Z + USED_IMAGE=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:25:41.0076845Z +++ nproc --ignore=2 2025-12-04T09:25:41.0101559Z ++ docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e TEST_SHOWLOCALS -e NO_TEST_TIMEOUT -e NO_TD -e TD_DISTRIBUTED -e PR_LABELS -e MAX_JOBS=14 -e SCCACHE_BUCKET -e SCCACHE_REGION -e XLA_CUDA -e XLA_CLANG_CACHE_S3_BUCKET_NAME -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e SKIP_SCCACHE_INITIALIZATION=1 -e HUGGING_FACE_HUB_TOKEN -e VLLM_TEST_HUGGING_FACE_TOKEN -e SCRIBE_GRAPHQL_ACCESS_TOKEN -e DASHBOARD_TAG -e ARTIFACTS_FILE_SUFFIX --memory=61g --memory-swap=64g --env-file=/tmp/github_env_19922826259 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --shm-size=2g --tty --detach --name= --user jenkins -v /home/ec2-user/actions-runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:25:49.7235772Z + container_name=2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T09:25:49.7237280Z + echo DOCKER_CONTAINER_ID=2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T09:25:49.7237899Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *\s\3\9\0\x* ]] 2025-12-04T09:25:49.7242223Z ++ echo dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:25:49.7245700Z + docker exec -t 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea sh -c 'python3 -m pip install dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl[opt-einsum] && .ci/pytorch/test.sh' 2025-12-04T09:25:50.2313709Z Processing ./dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl (from torch==2.10.0a0+gitffd9b0f) 2025-12-04T09:25:50.5538122Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.18.0) 2025-12-04T09:25:50.5542246Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (4.12.2) 2025-12-04T09:25:50.5547055Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.13.3) 2025-12-04T09:25:50.5551986Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (2.8.8) 2025-12-04T09:25:50.5555823Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.1.6) 2025-12-04T09:25:50.5560785Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (2025.10.0) 2025-12-04T09:25:50.5574449Z Requirement already satisfied: opt-einsum>=3.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.3.0) 2025-12-04T09:25:50.5959962Z Requirement already satisfied: numpy>=1.7 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from opt-einsum>=3.3->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.22.4) 2025-12-04T09:25:50.5978645Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.3.0) 2025-12-04T09:25:50.6037510Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.0.3) 2025-12-04T09:25:50.9819951Z Installing collected packages: torch 2025-12-04T09:26:02.3298109Z Successfully installed torch-2.10.0a0+gitffd9b0f 2025-12-04T09:26:02.4038012Z + export TERM=vt100 2025-12-04T09:26:02.4038287Z + TERM=vt100 2025-12-04T09:26:02.4040268Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:26:02.4052260Z + source .ci/pytorch/common.sh 2025-12-04T09:26:02.4056289Z +++ dirname .ci/pytorch/common.sh 2025-12-04T09:26:02.4065117Z ++ source .ci/pytorch/common_utils.sh 2025-12-04T09:26:02.4066194Z +++ declare -f -t trap_add 2025-12-04T09:26:02.4071487Z ++ set -ex -o pipefail 2025-12-04T09:26:02.4071861Z ++ [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *rocm* ]] 2025-12-04T09:26:02.4072280Z ++ BUILD_TEST_LIBTORCH=0 2025-12-04T09:26:02.4075252Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:26:02.4083965Z + source .ci/pytorch/common-build.sh 2025-12-04T09:26:02.4085728Z ++ [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *win-* ]] 2025-12-04T09:26:02.4091262Z ++++ dirname .ci/pytorch/common-build.sh 2025-12-04T09:26:02.4100580Z +++ cd .ci/pytorch 2025-12-04T09:26:02.4100823Z +++ pwd -P 2025-12-04T09:26:02.4103766Z ++ script_dir=/var/lib/jenkins/workspace/.ci/pytorch 2025-12-04T09:26:02.4104238Z ++ [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *-pch* ]] 2025-12-04T09:26:02.4104623Z ++ which sccache 2025-12-04T09:26:02.4177720Z ++ [[ -z ossci-compiler-cache-circleci-v2 ]] 2025-12-04T09:26:02.4178088Z ++ sccache --stop-server 2025-12-04T09:26:02.4208267Z ++ true 2025-12-04T09:26:02.4208520Z ++ rm -f /var/lib/jenkins/sccache_error.log 2025-12-04T09:26:02.4220176Z ++ trap_add sccache_epilogue EXIT 2025-12-04T09:26:02.4220560Z ++ trap_add_cmd=sccache_epilogue 2025-12-04T09:26:02.4220879Z ++ shift 2025-12-04T09:26:02.4221257Z ++ for trap_add_name in "$@" 2025-12-04T09:26:02.4227046Z ++++ trap -p EXIT 2025-12-04T09:26:02.4230472Z +++ eval 'extract_trap_cmd ' 2025-12-04T09:26:02.4230756Z ++++ extract_trap_cmd 2025-12-04T09:26:02.4230997Z ++++ printf '%s\n' '' 2025-12-04T09:26:02.4231256Z +++ printf '%s\n' sccache_epilogue 2025-12-04T09:26:02.4234138Z ++ trap -- ' 2025-12-04T09:26:02.4234400Z sccache_epilogue' EXIT 2025-12-04T09:26:02.4234642Z ++ [[ -n 1 ]] 2025-12-04T09:26:02.4235013Z ++ echo 'Skipping sccache server initialization, setting environment variables' 2025-12-04T09:26:02.4235577Z Skipping sccache server initialization, setting environment variables 2025-12-04T09:26:02.4236007Z ++ export SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:26:02.4236283Z ++ SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:26:02.4236616Z ++ export SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:26:02.4237042Z ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:26:02.4242471Z ++ export RUST_LOG=sccache::server=error 2025-12-04T09:26:02.4242798Z ++ RUST_LOG=sccache::server=error 2025-12-04T09:26:02.4243090Z ++ sccache --zero-stats 2025-12-04T09:26:02.7823572Z Statistics zeroed. 2025-12-04T09:26:02.7832693Z ++ which ccache 2025-12-04T09:26:02.7891814Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *rocm* ]] 2025-12-04T09:26:02.7892314Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *s390x* ]] 2025-12-04T09:26:02.7892720Z + [[ -d /var/lib/jenkins/workspace ]] 2025-12-04T09:26:02.7894968Z ++ stat -c %u /var/lib/jenkins/workspace 2025-12-04T09:26:02.7912861Z + WORKSPACE_ORIGINAL_OWNER_ID=1000 2025-12-04T09:26:02.7913176Z + trap_add cleanup_workspace EXIT 2025-12-04T09:26:02.7913473Z + trap_add_cmd=cleanup_workspace 2025-12-04T09:26:02.7913742Z + shift 2025-12-04T09:26:02.7913960Z + for trap_add_name in "$@" 2025-12-04T09:26:02.7920943Z +++ trap -p EXIT 2025-12-04T09:26:02.7924843Z ++ eval 'extract_trap_cmd trap -- '\'' 2025-12-04T09:26:02.7925293Z sccache_epilogue'\'' EXIT' 2025-12-04T09:26:02.7925668Z +++ extract_trap_cmd trap -- ' 2025-12-04T09:26:02.7926041Z sccache_epilogue' EXIT 2025-12-04T09:26:02.7926380Z +++ printf '%s\n' ' 2025-12-04T09:26:02.7926678Z sccache_epilogue' 2025-12-04T09:26:02.7926925Z ++ printf '%s\n' cleanup_workspace 2025-12-04T09:26:02.7929717Z + trap -- ' 2025-12-04T09:26:02.7929929Z sccache_epilogue 2025-12-04T09:26:02.7930147Z cleanup_workspace' EXIT 2025-12-04T09:26:02.7930435Z + sudo chown -R jenkins /var/lib/jenkins/workspace 2025-12-04T09:26:03.8248774Z + git config --global --add safe.directory /var/lib/jenkins/workspace 2025-12-04T09:26:03.8272325Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *cuda* ]] 2025-12-04T09:26:03.8275838Z ++ python -c 'import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))' 2025-12-04T09:26:04.2570683Z + NUMBA_CUDA_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:26:04.2571491Z + '[' -n /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda ']' 2025-12-04T09:26:04.2575845Z +++ realpath .ci/pytorch/test.sh 2025-12-04T09:26:04.2588425Z ++ dirname /var/lib/jenkins/workspace/.ci/pytorch/test.sh 2025-12-04T09:26:04.2676727Z + NUMBA_PATCH=/var/lib/jenkins/workspace/.ci/pytorch/numba-cuda-13.patch 2025-12-04T09:26:04.2677658Z + pushd /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:26:04.2678202Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda ~/workspace 2025-12-04T09:26:04.2678628Z + patch -p4 2025-12-04T09:26:04.2692512Z patching file cudadrv/driver.py 2025-12-04T09:26:04.2692831Z Hunk #1 succeeded at 357 (offset -8 lines). 2025-12-04T09:26:04.2752242Z + popd 2025-12-04T09:26:04.2752458Z ~/workspace 2025-12-04T09:26:04.2752674Z + echo 'Environment variables:' 2025-12-04T09:26:04.2752952Z Environment variables: 2025-12-04T09:26:04.2753187Z + env 2025-12-04T09:26:04.2762945Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:26:04.2763531Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:26:04.2764079Z BUILD_ENVIRONMENT=linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck 2025-12-04T09:26:04.2764824Z VLLM_TEST_HUGGING_FACE_TOKEN=*** 2025-12-04T09:26:04.2765269Z HOSTNAME=2719baa28228 2025-12-04T09:26:04.2766019Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.2766625Z GITHUB_ACTION=__run_3 2025-12-04T09:26:04.2766889Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:26:04.2767179Z GITHUB_RUN_NUMBER=19107 2025-12-04T09:26:04.2767423Z TEST_CONFIG=default 2025-12-04T09:26:04.2767664Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:26:04.2767978Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2025-12-04T09:26:04.2768288Z SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:26:04.2768653Z SCRIBE_GRAPHQL_ACCESS_TOKEN=*** 2025-12-04T09:26:04.2768941Z GITHUB_TRIGGERING_ACTOR=huydhn 2025-12-04T09:26:04.2769208Z GITHUB_REF_TYPE=branch 2025-12-04T09:26:04.2769489Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.2770262Z XLA_CUDA= 2025-12-04T09:26:04.2770568Z NCCL_LIB_DIR=/usr/local/cuda/lib64/ 2025-12-04T09:26:04.2771113Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:26:04.2771713Z *** 2025-12-04T09:26:04.2771929Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:26:04.2772206Z GITHUB_ACTIONS=true 2025-12-04T09:26:04.2772452Z NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:26:04.2772813Z SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:26:04.2773193Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.2773560Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.2774063Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/periodic.yml@refs/heads/main 2025-12-04T09:26:04.2774515Z UCC_HOME=/usr 2025-12-04T09:26:04.2774735Z VERBOSE_TEST_LOGS=False 2025-12-04T09:26:04.2774989Z GITHUB_REF=refs/heads/main 2025-12-04T09:26:04.2775238Z SHARD_NUMBER=5 2025-12-04T09:26:04.2775459Z GITHUB_REF_PROTECTED=true 2025-12-04T09:26:04.2775704Z HOME=/var/lib/jenkins 2025-12-04T09:26:04.2775963Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:26:04.2776287Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:26:04.2776618Z UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152 2025-12-04T09:26:04.2776937Z USE_SYSTEM_NCCL=1 2025-12-04T09:26:04.2777158Z NUM_TEST_SHARDS=8 2025-12-04T09:26:04.2777379Z UCX_HOME=/usr 2025-12-04T09:26:04.2777909Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.2778933Z JOB_NAME=linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:26:04.2779928Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.2780685Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:26:04.2781157Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:26:04.2781407Z DASHBOARD_TAG= 2025-12-04T09:26:04.2781642Z GITHUB_RUN_ID=19922826259 2025-12-04T09:26:04.2781890Z INSTALLED_OPENBLAS= 2025-12-04T09:26:04.2782460Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.2783271Z GITHUB_ACTOR=huydhn 2025-12-04T09:26:04.2783502Z PR_NUMBER= 2025-12-04T09:26:04.2783705Z DESIRED_CUDA=12.8.1 2025-12-04T09:26:04.2783936Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:26:04.2784186Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:26:04.2784504Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:26:04.2784840Z TERM=vt100 2025-12-04T09:26:04.2785096Z INSTALLED_VISION=yes 2025-12-04T09:26:04.2785330Z BRANCH=main 2025-12-04T09:26:04.2785549Z SCCACHE_REGION=us-east-1 2025-12-04T09:26:04.2785808Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:26:04.2786072Z BUILD_AOT_INDUCTOR_TEST= 2025-12-04T09:26:04.2786323Z CUDA_PATH=/usr/local/cuda 2025-12-04T09:26:04.2786816Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2025-12-04T09:26:04.2787364Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:26:04.2787705Z UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96 2025-12-04T09:26:04.2788032Z REENABLED_ISSUES= 2025-12-04T09:26:04.2788256Z DOCS= 2025-12-04T09:26:04.2788450Z SHLVL=1 2025-12-04T09:26:04.2788647Z MAX_JOBS=14 2025-12-04T09:26:04.2788863Z GITHUB_ACTOR_ID=475357 2025-12-04T09:26:04.2789189Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.2789560Z GITHUB_REF_NAME=main 2025-12-04T09:26:04.2789923Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:26:04.2790323Z GITHUB_JOB=test 2025-12-04T09:26:04.2790549Z NO_TEST_TIMEOUT=False 2025-12-04T09:26:04.2790790Z TD_DISTRIBUTED=False 2025-12-04T09:26:04.2791038Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:26:04.2791327Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:26:04.2791590Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:26:04.2791842Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:26:04.2792668Z PATH=/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:26:04.2793413Z GITHUB_BASE_REF= 2025-12-04T09:26:04.2793637Z INSTALLED_ACL= 2025-12-04T09:26:04.2794021Z ARTIFACTS_FILE_SUFFIX=test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T09:26:04.2794461Z CI=true 2025-12-04T09:26:04.2794678Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:26:04.2794988Z RUST_LOG=sccache::server=error 2025-12-04T09:26:04.2795249Z JOB_ID=57118183207 2025-12-04T09:26:04.2795477Z GITHUB_HEAD_REF= 2025-12-04T09:26:04.2795694Z GITHUB_ACTION_REF= 2025-12-04T09:26:04.2795975Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2025-12-04T09:26:04.2796312Z TEST_SHOWLOCALS=False 2025-12-04T09:26:04.2796554Z GITHUB_WORKFLOW=periodic 2025-12-04T09:26:04.2796835Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:26:04.2797431Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.2798036Z NO_TD=False 2025-12-04T09:26:04.2798268Z SKIP_SCCACHE_INITIALIZATION=1 2025-12-04T09:26:04.2798560Z NCCL_INCLUDE_DIR=/usr/local/cuda/include/ 2025-12-04T09:26:04.2798864Z _=/usr/bin/env 2025-12-04T09:26:04.2799208Z OLDPWD=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:26:04.2799803Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2025-12-04T09:26:04.2910908Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2025-12-04T09:26:04.2911481Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:26:04.2912039Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2025-12-04T09:26:04.2912597Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2025-12-04T09:26:04.2913019Z + BUILD_DIR=build 2025-12-04T09:26:04.2913247Z + BUILD_RENAMED_DIR=build_renamed 2025-12-04T09:26:04.2913539Z + BUILD_BIN_DIR=build/bin 2025-12-04T09:26:04.2913780Z + SHARD_NUMBER=5 2025-12-04T09:26:04.2913995Z + NUM_TEST_SHARDS=8 2025-12-04T09:26:04.2914239Z + export TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:26:04.2914532Z + TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:26:04.2915006Z + export VALGRIND=ON 2025-12-04T09:26:04.2915259Z + VALGRIND=ON 2025-12-04T09:26:04.2915591Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *clang9* ]] 2025-12-04T09:26:04.2916073Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *xpu* ]] 2025-12-04T09:26:04.2916445Z + detect_cuda_arch 2025-12-04T09:26:04.2916762Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *cuda* ]] 2025-12-04T09:26:04.2917150Z + command -v nvidia-smi 2025-12-04T09:26:04.2917385Z /usr/bin/nvidia-smi 2025-12-04T09:26:04.2923239Z ++ nvidia-smi --query-gpu=compute_cap --format=csv 2025-12-04T09:26:04.2924287Z ++ tail -n 1 2025-12-04T09:26:04.3268730Z + TORCH_CUDA_ARCH_LIST=8.6 2025-12-04T09:26:04.3269072Z + export TORCH_CUDA_ARCH_LIST 2025-12-04T09:26:04.3269477Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *s390x* ]] 2025-12-04T09:26:04.3269860Z + [[ 1 == \1 ]] 2025-12-04T09:26:04.3270074Z + ulimit -c 0 2025-12-04T09:26:04.3270390Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *bazel* ]] 2025-12-04T09:26:04.3273313Z ++ realpath build/custom_test_artifacts 2025-12-04T09:26:04.3492845Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/workspace/build/custom_test_artifacts 2025-12-04T09:26:04.3493472Z + [[ -n '' ]] 2025-12-04T09:26:04.3493785Z + echo 'Environment variables' 2025-12-04T09:26:04.3494164Z Environment variables 2025-12-04T09:26:04.3494390Z + env 2025-12-04T09:26:04.3659077Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:26:04.3659853Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:26:04.3660365Z BUILD_ENVIRONMENT=linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck 2025-12-04T09:26:04.3661141Z VLLM_TEST_HUGGING_FACE_TOKEN=*** 2025-12-04T09:26:04.3661472Z HOSTNAME=2719baa28228 2025-12-04T09:26:04.3662427Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.3663122Z GITHUB_ACTION=__run_3 2025-12-04T09:26:04.3663482Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:26:04.3663777Z GITHUB_RUN_NUMBER=19107 2025-12-04T09:26:04.3664026Z TEST_CONFIG=default 2025-12-04T09:26:04.3664275Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:26:04.3664594Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2025-12-04T09:26:04.3664901Z SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:26:04.3665325Z SCRIBE_GRAPHQL_ACCESS_TOKEN=*** 2025-12-04T09:26:04.3665609Z GITHUB_TRIGGERING_ACTOR=huydhn 2025-12-04T09:26:04.3665882Z GITHUB_REF_TYPE=branch 2025-12-04T09:26:04.3666124Z TORCH_CUDA_ARCH_LIST=8.6 2025-12-04T09:26:04.3666428Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.3666754Z XLA_CUDA= 2025-12-04T09:26:04.3666970Z NCCL_LIB_DIR=/usr/local/cuda/lib64/ 2025-12-04T09:26:04.3667551Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:26:04.3667868Z *** 2025-12-04T09:26:04.3668075Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:26:04.3668334Z GITHUB_ACTIONS=true 2025-12-04T09:26:04.3668579Z NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:26:04.3669042Z SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:26:04.3669539Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.3669956Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.3670667Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/periodic.yml@refs/heads/main 2025-12-04T09:26:04.3671284Z UCC_HOME=/usr 2025-12-04T09:26:04.3671574Z TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:26:04.3671967Z VERBOSE_TEST_LOGS=False 2025-12-04T09:26:04.3672322Z GITHUB_REF=refs/heads/main 2025-12-04T09:26:04.3672637Z SHARD_NUMBER=5 2025-12-04T09:26:04.3672857Z GITHUB_REF_PROTECTED=true 2025-12-04T09:26:04.3673106Z HOME=/var/lib/jenkins 2025-12-04T09:26:04.3673366Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:26:04.3673684Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:26:04.3674019Z UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152 2025-12-04T09:26:04.3674333Z USE_SYSTEM_NCCL=1 2025-12-04T09:26:04.3674551Z NUM_TEST_SHARDS=8 2025-12-04T09:26:04.3674763Z UCX_HOME=/usr 2025-12-04T09:26:04.3675523Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.3676573Z JOB_NAME=linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T09:26:04.3677570Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.3678319Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:26:04.3678793Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:26:04.3679045Z DASHBOARD_TAG= 2025-12-04T09:26:04.3679270Z GITHUB_RUN_ID=19922826259 2025-12-04T09:26:04.3679514Z INSTALLED_OPENBLAS= 2025-12-04T09:26:04.3680183Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.3680812Z GITHUB_ACTOR=huydhn 2025-12-04T09:26:04.3681036Z PR_NUMBER= 2025-12-04T09:26:04.3681241Z DESIRED_CUDA=12.8.1 2025-12-04T09:26:04.3681481Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:26:04.3681730Z VALGRIND=ON 2025-12-04T09:26:04.3681952Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:26:04.3682326Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:26:04.3682656Z TERM=vt100 2025-12-04T09:26:04.3682865Z INSTALLED_VISION=yes 2025-12-04T09:26:04.3683089Z BRANCH=main 2025-12-04T09:26:04.3683307Z SCCACHE_REGION=us-east-1 2025-12-04T09:26:04.3683577Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:26:04.3683839Z BUILD_AOT_INDUCTOR_TEST= 2025-12-04T09:26:04.3684098Z CUDA_PATH=/usr/local/cuda 2025-12-04T09:26:04.3684594Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2025-12-04T09:26:04.3685232Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:26:04.3685580Z UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96 2025-12-04T09:26:04.3685906Z REENABLED_ISSUES= 2025-12-04T09:26:04.3686123Z DOCS= 2025-12-04T09:26:04.3686307Z SHLVL=1 2025-12-04T09:26:04.3686499Z MAX_JOBS=14 2025-12-04T09:26:04.3686718Z GITHUB_ACTOR_ID=475357 2025-12-04T09:26:04.3687044Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:26:04.3687419Z GITHUB_REF_NAME=main 2025-12-04T09:26:04.3699325Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:26:04.3699746Z GITHUB_JOB=test 2025-12-04T09:26:04.3699981Z NO_TEST_TIMEOUT=False 2025-12-04T09:26:04.3700226Z TD_DISTRIBUTED=False 2025-12-04T09:26:04.3700853Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:26:04.3701158Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:26:04.3701426Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:26:04.3701688Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:26:04.3702424Z PATH=/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:26:04.3703182Z GITHUB_BASE_REF= 2025-12-04T09:26:04.3703400Z INSTALLED_ACL= 2025-12-04T09:26:04.3703792Z ARTIFACTS_FILE_SUFFIX=test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T09:26:04.3704228Z CI=true 2025-12-04T09:26:04.3704446Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:26:04.3704751Z RUST_LOG=sccache::server=error 2025-12-04T09:26:04.3705014Z JOB_ID=57118183207 2025-12-04T09:26:04.3705238Z GITHUB_HEAD_REF= 2025-12-04T09:26:04.3705473Z GITHUB_ACTION_REF= 2025-12-04T09:26:04.3705770Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2025-12-04T09:26:04.3706108Z TEST_SHOWLOCALS=False 2025-12-04T09:26:04.3706350Z GITHUB_WORKFLOW=periodic 2025-12-04T09:26:04.3706613Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:26:04.3707205Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_7f3107e4-6029-4f3d-8f43-ae3b2474cd54 2025-12-04T09:26:04.3707802Z NO_TD=False 2025-12-04T09:26:04.3708025Z SKIP_SCCACHE_INITIALIZATION=1 2025-12-04T09:26:04.3708319Z NCCL_INCLUDE_DIR=/usr/local/cuda/include/ 2025-12-04T09:26:04.3708740Z OLDPWD=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:26:04.3709325Z _=/usr/bin/env 2025-12-04T09:26:04.3709555Z + echo 'Testing pytorch' 2025-12-04T09:26:04.3709810Z Testing pytorch 2025-12-04T09:26:04.3710033Z + export LANG=C.UTF-8 2025-12-04T09:26:04.3710264Z + LANG=C.UTF-8 2025-12-04T09:26:04.3710473Z + PR_NUMBER= 2025-12-04T09:26:04.3710688Z + [[ default == \d\e\f\a\u\l\t ]] 2025-12-04T09:26:04.3710969Z + export CUDA_VISIBLE_DEVICES=0 2025-12-04T09:26:04.3711237Z + CUDA_VISIBLE_DEVICES=0 2025-12-04T09:26:04.3711490Z + export HIP_VISIBLE_DEVICES=0 2025-12-04T09:26:04.3711758Z + HIP_VISIBLE_DEVICES=0 2025-12-04T09:26:04.3712013Z + [[ default == \d\i\s\t\r\i\b\u\t\e\d ]] 2025-12-04T09:26:04.3712298Z + [[ default == \s\l\o\w ]] 2025-12-04T09:26:04.3712688Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *slow-gradcheck* ]] 2025-12-04T09:26:04.3713154Z + export PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 2025-12-04T09:26:04.3713471Z + PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 2025-12-04T09:26:04.3713784Z + export PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 2025-12-04T09:26:04.3714111Z + PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 2025-12-04T09:26:04.3714566Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *cuda* ]] 2025-12-04T09:26:04.3715017Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:26:04.3715354Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:26:04.3715656Z + [[ default == *crossref* ]] 2025-12-04T09:26:04.3716010Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *rocm* ]] 2025-12-04T09:26:04.3716488Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *xpu* ]] 2025-12-04T09:26:04.3716978Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *-bazel-* ]] 2025-12-04T09:26:04.3717375Z + pip_install ninja==1.10.2 2025-12-04T09:26:04.3717725Z + pip_install_pkg='python3 -m pip install --progress-bar off' 2025-12-04T09:26:04.3718317Z + python3 -m pip install --progress-bar off ninja==1.10.2 2025-12-04T09:26:04.9431030Z Collecting ninja==1.10.2 2025-12-04T09:26:04.9671023Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) 2025-12-04T09:26:05.0117305Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2025-12-04T09:26:05.4193162Z Installing collected packages: ninja 2025-12-04T09:26:05.4193830Z Attempting uninstall: ninja 2025-12-04T09:26:05.4199252Z Found existing installation: ninja 1.11.1.4 2025-12-04T09:26:05.4224318Z Uninstalling ninja-1.11.1.4: 2025-12-04T09:26:05.4359444Z Successfully uninstalled ninja-1.11.1.4 2025-12-04T09:26:05.5074835Z Successfully installed ninja-1.10.2 2025-12-04T09:26:05.5676222Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:26:05.5678001Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:26:05.5678982Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *aarch64* ]] 2025-12-04T09:26:05.5679488Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *asan* ]] 2025-12-04T09:26:05.5680077Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *-debug* ]] 2025-12-04T09:26:05.5680579Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *-bazel-* ]] 2025-12-04T09:26:05.5681225Z + echo 'We are not in debug mode: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck. Expect the assertion to pass' 2025-12-04T09:26:05.5681996Z We are not in debug mode: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck. Expect the assertion to pass 2025-12-04T09:26:05.5682509Z + cd test 2025-12-04T09:26:05.5682834Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2025-12-04T09:26:07.2296050Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2025-12-04T09:26:07.2296434Z + [[ default == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2025-12-04T09:26:07.2296794Z + [[ default == \l\e\g\a\c\y\_\n\v\i\d\i\a\_\d\r\i\v\e\r ]] 2025-12-04T09:26:07.2301131Z + DYNAMO_BENCHMARK_FLAGS=() 2025-12-04T09:26:07.2301573Z + [[ default == *pr_time_benchmarks* ]] 2025-12-04T09:26:07.2301886Z + [[ default == *dynamo_eager* ]] 2025-12-04T09:26:07.2302174Z + [[ default == *aot_eager* ]] 2025-12-04T09:26:07.2302438Z + [[ default == *aot_inductor* ]] 2025-12-04T09:26:07.2302718Z + [[ default == *max_autotune_inductor* ]] 2025-12-04T09:26:07.2303016Z + [[ default == *inductor* ]] 2025-12-04T09:26:07.2303282Z + [[ default == *dynamic* ]] 2025-12-04T09:26:07.2303527Z + [[ default == *cpu* ]] 2025-12-04T09:26:07.2303764Z + [[ default == *xpu* ]] 2025-12-04T09:26:07.2304040Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2025-12-04T09:26:07.2334390Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *libtorch* ]] 2025-12-04T09:26:07.2334931Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *-bazel-* ]] 2025-12-04T09:26:07.2337647Z + cd test 2025-12-04T09:26:07.2338081Z + python -c 'import torch; print(torch.__config__.show())' 2025-12-04T09:26:08.8905693Z PyTorch built with: 2025-12-04T09:26:08.8906096Z - GCC 11.4 2025-12-04T09:26:08.8906386Z - C++ Version: 201703 2025-12-04T09:26:08.8906987Z - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:26:08.8907653Z - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:26:08.8908085Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:26:08.8908404Z - LAPACK is enabled (usually provided by MKL) 2025-12-04T09:26:08.8908726Z - NNPACK is enabled 2025-12-04T09:26:08.8908983Z - CPU capability usage: AVX2 2025-12-04T09:26:08.8909249Z - CUDA Runtime 12.8 2025-12-04T09:26:08.8909589Z - NVCC architecture flags: -gencode;arch=compute_86,code=sm_86 2025-12-04T09:26:08.8910335Z - CuDNN 91.0.2 (built against CUDA 12.9) 2025-12-04T09:26:08.8914776Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32, CUDA_VERSION=12.8, CUDNN_VERSION=9.10.2, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Werror -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=ON, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 2025-12-04T09:26:08.8919401Z 2025-12-04T09:26:09.2693646Z + cd test 2025-12-04T09:26:09.2694038Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2025-12-04T09:26:10.5968764Z ATen/Parallel: 2025-12-04T09:26:10.5969087Z at::get_num_threads() : 8 2025-12-04T09:26:10.5969375Z at::get_num_interop_threads() : 16 2025-12-04T09:26:10.5969682Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:26:10.5969969Z omp_get_max_threads() : 8 2025-12-04T09:26:10.5970490Z Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:26:10.5971048Z mkl_get_max_threads() : 8 2025-12-04T09:26:10.5971410Z Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:26:10.5971862Z std::thread::hardware_concurrency() : 16 2025-12-04T09:26:10.5972158Z Environment variables: 2025-12-04T09:26:10.5972408Z OMP_NUM_THREADS : [not set] 2025-12-04T09:26:10.5972668Z MKL_NUM_THREADS : [not set] 2025-12-04T09:26:10.5973294Z ATen parallel backend: OpenMP 2025-12-04T09:26:10.5973485Z 2025-12-04T09:26:10.9236567Z + [[ default == *numpy_2* ]] 2025-12-04T09:26:10.9237199Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *aarch64* ]] 2025-12-04T09:26:10.9237680Z + [[ default == *backward* ]] 2025-12-04T09:26:10.9237992Z + [[ default == *libtorch_agnostic_targetting* ]] 2025-12-04T09:26:10.9238307Z + [[ default == *xla* ]] 2025-12-04T09:26:10.9238543Z + [[ default == *vllm* ]] 2025-12-04T09:26:10.9238792Z + [[ default == *executorch* ]] 2025-12-04T09:26:10.9239063Z + [[ default == \j\i\t\_\l\e\g\a\c\y ]] 2025-12-04T09:26:10.9239353Z + [[ default == \q\u\a\n\t\i\z\a\t\i\o\n ]] 2025-12-04T09:26:10.9239848Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *libtorch* ]] 2025-12-04T09:26:10.9240283Z + [[ default == distributed ]] 2025-12-04T09:26:10.9240559Z + [[ default == *operator_benchmark* ]] 2025-12-04T09:26:10.9240878Z + [[ default == *operator_microbenchmark* ]] 2025-12-04T09:26:10.9241221Z + [[ default == *attention_microbenchmark* ]] 2025-12-04T09:26:10.9241545Z + [[ default == *inductor_distributed* ]] 2025-12-04T09:26:10.9241842Z + [[ default == *inductor-halide* ]] 2025-12-04T09:26:10.9242141Z + [[ default == *inductor-pallas* ]] 2025-12-04T09:26:10.9242470Z + [[ default == *inductor-triton-cpu* ]] 2025-12-04T09:26:10.9242894Z + [[ default == *inductor-micro-benchmark* ]] 2025-12-04T09:26:10.9243338Z + [[ default == *aoti_cross_compile_for_windows* ]] 2025-12-04T09:26:10.9243670Z + [[ default == *huggingface* ]] 2025-12-04T09:26:10.9243934Z + [[ default == *timm* ]] 2025-12-04T09:26:10.9244186Z + [[ default == cachebench ]] 2025-12-04T09:26:10.9244455Z + [[ default == verify_cachebench ]] 2025-12-04T09:26:10.9244737Z + [[ default == *torchbench* ]] 2025-12-04T09:26:10.9245379Z + [[ default == *inductor_cpp_wrapper* ]] 2025-12-04T09:26:10.9245685Z + [[ default == *inductor_core* ]] 2025-12-04T09:26:10.9245966Z + [[ default == *inductor* ]] 2025-12-04T09:26:10.9246222Z + [[ default == *einops* ]] 2025-12-04T09:26:10.9246486Z + [[ default == *dynamo_core* ]] 2025-12-04T09:26:10.9246784Z + [[ default == *dynamo_wrapped* ]] 2025-12-04T09:26:10.9247181Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *rocm* ]] 2025-12-04T09:26:10.9247559Z + [[ 5 == 1 ]] 2025-12-04T09:26:10.9247763Z + [[ 5 == 2 ]] 2025-12-04T09:26:10.9247964Z + [[ 5 -gt 2 ]] 2025-12-04T09:26:10.9248185Z + install_torchvision 2025-12-04T09:26:10.9248422Z + local orig_preload 2025-12-04T09:26:10.9248647Z + local commit 2025-12-04T09:26:10.9248865Z ++ get_pinned_commit vision 2025-12-04T09:26:10.9249138Z ++ cat .github/ci_commit_pins/vision.txt 2025-12-04T09:26:10.9261424Z + commit=617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:26:10.9261890Z + orig_preload= 2025-12-04T09:26:10.9262200Z + '[' -n '' ']' 2025-12-04T09:26:10.9262634Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck == *cuda* ]] 2025-12-04T09:26:10.9263021Z + export FORCE_CUDA=1 2025-12-04T09:26:10.9263245Z + FORCE_CUDA=1 2025-12-04T09:26:10.9263457Z + export WITH_CUDA=1 2025-12-04T09:26:10.9263689Z + WITH_CUDA=1 2025-12-04T09:26:10.9264215Z + pip_build_and_install git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e dist/vision 2025-12-04T09:26:10.9265045Z + local build_target=git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:26:10.9265578Z + local wheel_dir=dist/vision 2025-12-04T09:26:10.9265836Z + local found_whl=0 2025-12-04T09:26:10.9266077Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:26:10.9266361Z + [[ -f dist/vision/*.whl ]] 2025-12-04T09:26:10.9266604Z + '[' 0 == 0 ']' 2025-12-04T09:26:10.9267256Z + python3 -m pip wheel --no-build-isolation --no-deps -w dist/vision git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:26:11.2518433Z Collecting git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:26:11.2524507Z Cloning https://github.com/pytorch/vision.git (to revision 617079d944b0e72632311c30ae2bbdf1168b901e) to /tmp/pip-req-build-7s8hf3p4 2025-12-04T09:26:11.2695894Z Running command git clone --filter=blob:none --quiet https://github.com/pytorch/vision.git /tmp/pip-req-build-7s8hf3p4 2025-12-04T09:26:12.9428490Z Running command git rev-parse -q --verify 'sha^617079d944b0e72632311c30ae2bbdf1168b901e' 2025-12-04T09:26:12.9455407Z Running command git fetch -q https://github.com/pytorch/vision.git 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:26:13.0589436Z Resolved https://github.com/pytorch/vision.git to commit 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:26:15.1713378Z Preparing metadata (pyproject.toml) ... [?25l- \ | done 2025-12-04T09:26:15.1766949Z [?25hBuilding wheels for collected packages: torchvision 2025-12-04T09:27:32.9078329Z Building wheel for torchvision (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-12-04T09:27:32.9107323Z [?25h Created wheel for torchvision: filename=torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl size=1786535 sha256=2e8fb870e3ba2b5857aba48d512400f4ea81b9a599a9b153b629b4eaf8fb0d2c 2025-12-04T09:27:32.9108497Z Stored in directory: /var/lib/jenkins/.cache/pip/wheels/12/b2/29/1f82685c5b5173629e1f36a9b93989ce92ce563e5fb91d27ac 2025-12-04T09:27:32.9145053Z Successfully built torchvision 2025-12-04T09:27:33.0269596Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:27:33.0270322Z + pip_install_whl dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:27:33.0271009Z + args=('dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl') 2025-12-04T09:27:33.0271811Z + local args 2025-12-04T09:27:33.0272192Z + [[ dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl == *\ * ]] 2025-12-04T09:27:33.0272654Z + for path in "${args[@]}" 2025-12-04T09:27:33.0273102Z + echo 'Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl' 2025-12-04T09:27:33.0273970Z Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:27:33.0274937Z + python3 -mpip install --no-index --no-deps dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:27:33.3642837Z Processing ./dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:27:33.3739583Z Installing collected packages: torchvision 2025-12-04T09:27:33.8469756Z Successfully installed torchvision-0.25.0a0+617079d 2025-12-04T09:27:33.8847442Z + '[' -n '' ']' 2025-12-04T09:27:33.8847699Z + test_python_shard 5 2025-12-04T09:27:33.8847948Z + [[ -z 8 ]] 2025-12-04T09:27:33.8848676Z + python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard 5 8 --verbose --upload-artifacts-while-running 2025-12-04T09:27:36.9920655Z Excluding doctests Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9921701Z Excluding test_meta Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9922551Z Excluding test_hub Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9923256Z Excluding test_fx Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9923945Z Excluding test_decomp Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9924768Z Excluding test_cpp_extensions_jit Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9925586Z Excluding test_jit Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9926368Z Excluding test_matmul_cuda Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9927164Z Excluding test_ops Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9928230Z Excluding test_ops_jit Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9929076Z Excluding dynamo/test_recompile_ux Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9930038Z Excluding inductor/test_compiled_optimizers Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9931001Z Excluding inductor/test_cutlass_backend Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9931924Z Excluding inductor/test_max_autotune Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9932842Z Excluding inductor/test_select_algorithm Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:36.9933709Z Excluding inductor/test_smoke Running in slow gradcheck mode, skipping tests that don't use gradcheck. 2025-12-04T09:27:38.9818360Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:27:39.0354804Z Ignoring disabled issues: [''] 2025-12-04T09:27:39.0459194Z Found test times from artifacts 2025-12-04T09:27:39.0861410Z Found test times from artifacts 2025-12-04T09:27:39.0874082Z Running all tests 2025-12-04T09:27:39.1512907Z Running parallel tests on 1 processes 2025-12-04T09:27:39.1517643Z Name: tests to run (est. time: 143.33min) 2025-12-04T09:27:39.1518092Z Serial tests (58): 2025-12-04T09:27:39.1518358Z inductor/test_aot_inductor 5/5 2025-12-04T09:27:39.1518748Z inductor/test_torchinductor_codegen_dynamic_shapes 4/4 2025-12-04T09:27:39.1519181Z inductor/test_torchinductor_opinfo 7/14 2025-12-04T09:27:39.1519521Z inductor/test_pattern_matcher 1/1 2025-12-04T09:27:39.1520246Z inductor/test_cuda_repro 1/1 2025-12-04T09:27:39.1520585Z dynamo/test_activation_checkpointing 1/1 2025-12-04T09:27:39.1520956Z dynamo/test_logging 1/1 2025-12-04T09:27:39.1521207Z dynamo/test_repros 1/1 2025-12-04T09:27:39.1521526Z inductor/test_flex_attention 2/6 2025-12-04T09:27:39.1521858Z inductor/test_flex_decoding 2/3 2025-12-04T09:27:39.1522154Z dynamo/test_fx_graph_runnable 1/1 2025-12-04T09:27:39.1522532Z inductor/test_online_softmax 1/1 2025-12-04T09:27:39.1522825Z inductor/test_memory 1/1 2025-12-04T09:27:39.1523086Z dynamo/test_streams 1/1 2025-12-04T09:27:39.1523435Z inductor/test_unbacked_symints 1/1 2025-12-04T09:27:39.1523733Z dynamo/test_aot_compile 1/1 2025-12-04T09:27:39.1524019Z test_privateuseone_python_backend 1/1 2025-12-04T09:27:39.1524463Z test_varlen_attention 1/1 2025-12-04T09:27:39.1524748Z test_autograd 1/1 2025-12-04T09:27:39.1524991Z test_ops_fwd_gradients 7/12 2025-12-04T09:27:39.1525278Z test_ops_gradients 3/10 2025-12-04T09:27:39.1525593Z test_nestedtensor 1/4 2025-12-04T09:27:39.1525841Z test_sparse_csr 2/2 2025-12-04T09:27:39.1526069Z test_overrides 1/1 2025-12-04T09:27:39.1526403Z test_torchfuzz_repros 1/1 2025-12-04T09:27:39.1526692Z inductor/test_group_batch_fusion 1/1 2025-12-04T09:27:39.1527006Z dynamo/test_dynamic_shapes 1/1 2025-12-04T09:27:39.1527441Z inductor/test_custom_lowering 1/1 2025-12-04T09:27:39.1527862Z inductor/test_perf 1/1 2025-12-04T09:27:39.1528198Z inductor/test_mkldnn_pattern_matcher 1/2 2025-12-04T09:27:39.1528565Z inductor/test_cpu_cpp_wrapper 1/1 2025-12-04T09:27:39.1528909Z dynamo/test_deque_reconstruct 1/1 2025-12-04T09:27:39.1529193Z inductor/test_utils 1/1 2025-12-04T09:27:39.1529512Z inductor/test_indexing 1/1 2025-12-04T09:27:39.1529825Z inductor/test_inductor_annotations 1/1 2025-12-04T09:27:39.1530134Z inductor/test_compile_worker 1/1 2025-12-04T09:27:39.1530506Z export/test_serialize 1/1 2025-12-04T09:27:39.1530788Z export/test_export_strict 1/1 2025-12-04T09:27:39.1531068Z dynamo/test_buffers_override 1/1 2025-12-04T09:27:39.1531459Z inductor/test_split_cat_fx_passes 1/1 2025-12-04T09:27:39.1531759Z inductor/test_cache 1/1 2025-12-04T09:27:39.1532292Z inductor/test_aot_inductor_utils 1/1 2025-12-04T09:27:39.1532601Z inductor/test_control_flow 3/4 2025-12-04T09:27:39.1532904Z test_cpp_api_parity 1/1 2025-12-04T09:27:39.1533241Z test_foreach 2/2 2025-12-04T09:27:39.1533486Z nn/test_packed_sequence 1/1 2025-12-04T09:27:39.1533748Z test_numa_binding 1/1 2025-12-04T09:27:39.1534086Z test_pruning_op 1/1 2025-12-04T09:27:39.1534333Z test_jit_fuser_te 1/1 2025-12-04T09:27:39.1534590Z optim/test_lrscheduler 1/1 2025-12-04T09:27:39.1534973Z torch_np/numpy_tests/core/test_indexing 1/1 2025-12-04T09:27:39.1535292Z test_futures 1/1 2025-12-04T09:27:39.1535546Z test_tensor_creation_ops 1/1 2025-12-04T09:27:39.1535897Z test_scaled_matmul_cuda 1/1 2025-12-04T09:27:39.1536199Z torch_np/numpy_tests/core/test_shape_base 1/1 2025-12-04T09:27:39.1536558Z test_vulkan 1/1 2025-12-04T09:27:39.1536823Z lazy/test_generator 1/1 2025-12-04T09:27:39.1537081Z nn/test_convolution 1/2 2025-12-04T09:27:39.1537372Z Parallel tests (0): 2025-12-04T09:27:39.1537681Z Name: excluded (est. time: 0.0min) 2025-12-04T09:27:39.1537953Z Serial tests (0): 2025-12-04T09:27:39.1538184Z Parallel tests (0): 2025-12-04T09:27:39.1538647Z Running inductor/test_aot_inductor 5/5 ... [2025-12-04 09:27:39.152187][937.932159866] 2025-12-04T09:27:39.1539144Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:27:39.1540466Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor.py', '--shard-id=5', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:27:39.152598] 2025-12-04T09:31:56.0136259Z 2025-12-04T09:31:56.0137300Z inductor/test_aot_inductor 5/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_5.5_cecd223de7bcc4ce_.log 2025-12-04T09:31:56.0156918Z Running 50 items in this shard: test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_upper_bound_i64_cuda 2025-12-04T09:31:56.0175466Z 2025-12-04T09:31:56.0175767Z Finished inductor/test_aot_inductor 5/5 ... [2025-12-04 09:31:56.012702][1194.792675163], took 4.28min 2025-12-04T09:31:56.0176826Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-33c1f9cf025a3215.xml 2025-12-04T09:31:56.4883668Z Uploading artifacts took 0.16 seconds 2025-12-04T09:31:56.4886950Z Running inductor/test_torchinductor_codegen_dynamic_shapes 4/4 ... [2025-12-04 09:31:56.488362][1195.268333751] 2025-12-04T09:31:56.4887614Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:31:56.4891499Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_codegen_dynamic_shapes.py', '--shard-id=4', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:31:56.488757] 2025-12-04T09:32:06.3746461Z 2025-12-04T09:32:06.3747882Z inductor/test_torchinductor_codegen_dynamic_shapes 4/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_codegen_dynamic_shapes_4.4_aaf0e808e3c0e61a_.log 2025-12-04T09:32:06.3748885Z Running 0 items in this shard: 2025-12-04T09:32:06.3750290Z 2025-12-04T09:32:06.3750823Z Finished inductor/test_torchinductor_codegen_dynamic_shapes 4/4 ... [2025-12-04 09:32:06.374238][1205.154210429], took 0.16min 2025-12-04T09:32:06.3753467Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_codegen_dynamic_shapes/inductor.test_torchinductor_codegen_dynamic_shapes-714bf3905702b92c.xml 2025-12-04T09:32:06.4538057Z Running inductor/test_torchinductor_opinfo 7/14 ... [2025-12-04 09:32:06.453274][1205.233247414] 2025-12-04T09:32:06.4538794Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:32:06.4541715Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '--shard-id=7', '--num-shards=14', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:32:06.453599] 2025-12-04T09:32:20.5928125Z 2025-12-04T09:32:20.5929101Z inductor/test_torchinductor_opinfo 7/14 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_7.14_a33a8c2b64197fdb_.log 2025-12-04T09:32:20.5954598Z Running 50 items in this shard: test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64 2025-12-04T09:32:20.5978612Z 2025-12-04T09:32:20.5978951Z Finished inductor/test_torchinductor_opinfo 7/14 ... [2025-12-04 09:32:20.592559][1219.372531804], took 0.24min 2025-12-04T09:32:20.5980084Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-0945b0cce338d2d9.xml 2025-12-04T09:32:20.6666875Z Running inductor/test_pattern_matcher 1/1 ... [2025-12-04 09:32:20.666367][1219.446340606] 2025-12-04T09:32:20.6667461Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:32:20.6670456Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_pattern_matcher.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:32:20.666673] 2025-12-04T09:41:42.7216968Z 2025-12-04T09:41:42.7217838Z PRINTING LOG FILE of inductor/test_pattern_matcher 1/1 (test/test-reports/inductor.test_pattern_matcher_1.1_71c2676cd32e51e5_.log) 2025-12-04T09:41:42.7224745Z Test results will be stored in test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-c842e470cbb98a3c.xml 2025-12-04T09:41:42.7225767Z ============================= test session starts ============================== 2025-12-04T09:41:42.7226425Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:41:42.7227041Z cachedir: .pytest_cache 2025-12-04T09:41:42.7227834Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:41:42.7228624Z rootdir: /var/lib/jenkins/workspace 2025-12-04T09:41:42.7228948Z configfile: pytest.ini 2025-12-04T09:41:42.7229557Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:41:42.7230293Z collecting ... collected 52 items 2025-12-04T09:41:42.7230709Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:41:42.7328587Z Running 250 items in this shard: test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes 2025-12-04T09:41:42.7460137Z 2025-12-04T09:41:42.7461142Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 0%] 2025-12-04T09:41:42.7463167Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 0%] 2025-12-04T09:41:42.7464741Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [6.7734s] [ 1%] 2025-12-04T09:41:42.7465792Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.1899s] [ 1%] 2025-12-04T09:41:42.7467409Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7469536Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7471433Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7473126Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7475340Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7477307Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7479122Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0009s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7481070Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7483046Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7485034Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7487002Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7489035Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7490946Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7493085Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7495070Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7497089Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7499003Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7501188Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7503175Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7505065Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7507105Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7509006Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7511006Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7512962Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7514891Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7516868Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7518815Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7520755Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7522646Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7524572Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7526492Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7528655Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7530575Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7532518Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7534532Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7536523Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7538356Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7540153Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7541880Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7543638Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7545691Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7547606Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7549424Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7551310Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7553259Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7555116Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7556980Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7558784Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7560642Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7562884Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7564872Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7566850Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7568827Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7570733Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7572779Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7574799Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7576802Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7578821Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7580600Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7582349Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7584246Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7586154Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7588125Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7590017Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7591900Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7593810Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7595758Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7597872Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7599891Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7601843Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7603731Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7605740Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7607360Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7609356Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7611333Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7613259Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7615329Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7617321Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7619313Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7621080Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7622991Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7624907Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7626672Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7628710Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7630805Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7632634Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7634662Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7636511Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7638530Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7640312Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7642118Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7643976Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7645970Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7648213Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7650278Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7652108Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7653993Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7655866Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7657857Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7659184Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.7856s] [ 2%] 2025-12-04T09:41:42.7660143Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.8412s] [ 2%] 2025-12-04T09:41:42.7661073Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.6229s] [ 2%] 2025-12-04T09:41:42.7662094Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.6761s] [ 2%] 2025-12-04T09:41:42.7663265Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4473s] [ 2%] 2025-12-04T09:41:42.7664235Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4579s] [ 2%] 2025-12-04T09:41:42.7665246Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4552s] [ 2%] 2025-12-04T09:41:42.7666329Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4629s] [ 2%] 2025-12-04T09:41:42.7667304Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.6464s] [ 2%] 2025-12-04T09:41:42.7668420Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4572s] [ 2%] 2025-12-04T09:41:42.7669504Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4886s] [ 2%] 2025-12-04T09:41:42.7670614Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4572s] [ 2%] 2025-12-04T09:41:42.7671457Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4412s] [ 2%] 2025-12-04T09:41:42.7672495Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5377s] [ 2%] 2025-12-04T09:41:42.7673402Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.6725s] [ 2%] 2025-12-04T09:41:42.7674522Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.7629s] [ 2%] 2025-12-04T09:41:42.7675431Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.7061s] [ 2%] 2025-12-04T09:41:42.7676457Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.8028s] [ 2%] 2025-12-04T09:41:42.7677532Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.8952s] [ 2%] 2025-12-04T09:41:42.7678616Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4332s] [ 2%] 2025-12-04T09:41:42.7679754Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5714s] [ 2%] 2025-12-04T09:41:42.7680747Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5732s] [ 2%] 2025-12-04T09:41:42.7681972Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.6897s] [ 2%] 2025-12-04T09:41:42.7683068Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4675s] [ 2%] 2025-12-04T09:41:42.7684178Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4760s] [ 2%] 2025-12-04T09:41:42.7685222Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4813s] [ 2%] 2025-12-04T09:41:42.7686237Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4645s] [ 2%] 2025-12-04T09:41:42.7687235Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4882s] [ 2%] 2025-12-04T09:41:42.7688249Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.6872s] [ 2%] 2025-12-04T09:41:42.7689195Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4941s] [ 2%] 2025-12-04T09:41:42.7690214Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5041s] [ 2%] 2025-12-04T09:41:42.7691266Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4895s] [ 2%] 2025-12-04T09:41:42.7692271Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4883s] [ 2%] 2025-12-04T09:41:42.7693300Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4759s] [ 2%] 2025-12-04T09:41:42.7694329Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4842s] [ 2%] 2025-12-04T09:41:42.7695407Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.7063s] [ 2%] 2025-12-04T09:41:42.7696463Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4503s] [ 2%] 2025-12-04T09:41:42.7697341Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4684s] [ 2%] 2025-12-04T09:41:42.7698352Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4937s] [ 2%] 2025-12-04T09:41:42.7699322Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5002s] [ 2%] 2025-12-04T09:41:42.7700240Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4901s] [ 2%] 2025-12-04T09:41:42.7701414Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4769s] [ 2%] 2025-12-04T09:41:42.7702451Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.7332s] [ 2%] 2025-12-04T09:41:42.7703560Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5040s] [ 2%] 2025-12-04T09:41:42.7704625Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4834s] [ 2%] 2025-12-04T09:41:42.7705640Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4575s] [ 2%] 2025-12-04T09:41:42.7706776Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.4809s] [ 2%] 2025-12-04T09:41:42.7707831Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5050s] [ 2%] 2025-12-04T09:41:42.7708757Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.5102s] [ 2%] 2025-12-04T09:41:42.7709882Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [9.0772s] [ 2%] 2025-12-04T09:41:42.7711042Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.4561s] [ 2%] 2025-12-04T09:41:42.7711943Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.8446s] [ 2%] 2025-12-04T09:41:42.7712738Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.7393s] [ 2%] 2025-12-04T09:41:42.7760164Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.0097s] [ 2%] 2025-12-04T09:41:42.7761525Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.1224s] [ 2%] 2025-12-04T09:41:42.7762681Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [8.1348s] [ 2%] 2025-12-04T09:41:42.7763804Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.0937s] [ 2%] 2025-12-04T09:41:42.7764921Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [5.6103s] [ 2%] 2025-12-04T09:41:42.7766022Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.7698s] [ 2%] 2025-12-04T09:41:42.7767194Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.6726s] [ 2%] 2025-12-04T09:41:42.7768319Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.4116s] [ 2%] 2025-12-04T09:41:42.7769440Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.8501s] [ 2%] 2025-12-04T09:41:42.7770561Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.1830s] [ 2%] 2025-12-04T09:41:42.7771692Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [5.9019s] [ 2%] 2025-12-04T09:41:42.7772847Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.5304s] [ 2%] 2025-12-04T09:41:42.7773947Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.4033s] [ 2%] 2025-12-04T09:41:42.7775259Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.8995s] [ 2%] 2025-12-04T09:41:42.7776406Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.7990s] [ 2%] 2025-12-04T09:41:42.7777523Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.3970s] [ 2%] 2025-12-04T09:41:42.7778625Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.0633s] [ 2%] 2025-12-04T09:41:42.7779719Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.2010s] [ 2%] 2025-12-04T09:41:42.7780875Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.4114s] [ 2%] 2025-12-04T09:41:42.7781984Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.2614s] [ 2%] 2025-12-04T09:41:42.7783049Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.0475s] [ 2%] 2025-12-04T09:41:42.7784167Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.0992s] [ 2%] 2025-12-04T09:41:42.7785305Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.1381s] [ 2%] 2025-12-04T09:41:42.7786351Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.0758s] [ 2%] 2025-12-04T09:41:42.7787292Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.5255s] [ 2%] 2025-12-04T09:41:42.7788387Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.4105s] [ 2%] 2025-12-04T09:41:42.7789321Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [5.9318s] [ 2%] 2025-12-04T09:41:42.7790272Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.1449s] [ 2%] 2025-12-04T09:41:42.7791227Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.8411s] [ 2%] 2025-12-04T09:41:42.7792141Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.8764s] [ 2%] 2025-12-04T09:41:42.7792956Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.7429s] [ 2%] 2025-12-04T09:41:42.7793755Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [9.1164s] [ 2%] 2025-12-04T09:41:42.7794543Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [9.6172s] [ 2%] 2025-12-04T09:41:42.7795352Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.0577s] [ 2%] 2025-12-04T09:41:42.7796151Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.0710s] [ 2%] 2025-12-04T09:41:42.7796949Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.1978s] [ 2%] 2025-12-04T09:41:42.7797740Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.2279s] [ 2%] 2025-12-04T09:41:42.7798536Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.0860s] [ 2%] 2025-12-04T09:41:42.7799570Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.5149s] [ 2%] 2025-12-04T09:41:42.7800972Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [9.3754s] [ 2%] 2025-12-04T09:41:42.7801865Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.2609s] [ 2%] 2025-12-04T09:41:42.7802667Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes PASSED [8.5809s] [ 2%] 2025-12-04T09:41:42.7803641Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.3124s] [ 2%] 2025-12-04T09:41:42.7804453Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [6.1455s] [ 2%] 2025-12-04T09:41:42.7805338Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes FAILED [7.9201s] [ 2%] 2025-12-04T09:41:42.7806561Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7808507Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7810681Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7812461Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7814448Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7816513Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7818699Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7821102Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7823325Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7825261Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7827127Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7829187Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7831224Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7833161Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7835313Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7837504Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7839333Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7841340Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7843132Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7844728Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7846341Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7847899Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7849667Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7851689Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7853511Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7855289Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7857392Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7859305Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7861398Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7863518Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7865447Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7867726Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7869801Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7871801Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7873479Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7875056Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7877095Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7879034Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7881274Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7883467Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7885719Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7887924Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7889970Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7891737Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7893375Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7894972Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7896801Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7899045Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0008s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7901460Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:41:42.7902657Z 2025-12-04T09:41:42.7902846Z =================================== FAILURES =================================== 2025-12-04T09:41:42.7903499Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.7904108Z Traceback (most recent call last): 2025-12-04T09:41:42.7904956Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.7905754Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.7906627Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.7911577Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.7912223Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.7912714Z Searched string: 2025-12-04T09:41:42.7913081Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.7913318Z 2025-12-04T09:41:42.7913505Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.7913844Z 2025-12-04T09:41:42.7914031Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.7914535Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.7914791Z 2025-12-04T09:41:42.7914893Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.7915258Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.7915615Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.7916001Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.7916248Z 2025-12-04T09:41:42.7916391Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.7916787Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.7917193Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.7917604Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.7917849Z 2025-12-04T09:41:42.7917855Z 2025-12-04T09:41:42.7918267Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.7918663Z 2025-12-04T09:41:42.7918669Z 2025-12-04T09:41:42.7918862Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.7919345Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.7919918Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.7920334Z idx_m = rm[:, None] 2025-12-04T09:41:42.7920655Z idx_n = rn[None, :] 2025-12-04T09:41:42.7921053Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.7921311Z 2025-12-04T09:41:42.7921456Z # inductor generates a suffix 2025-12-04T09:41:42.7921835Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.7922382Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.7922953Z ''', device_str='cuda') 2025-12-04T09:41:42.7923154Z 2025-12-04T09:41:42.7923159Z 2025-12-04T09:41:42.7923299Z async_compile.wait(globals()) 2025-12-04T09:41:42.7923668Z del async_compile 2025-12-04T09:41:42.7923879Z 2025-12-04T09:41:42.7924018Z class Runner: 2025-12-04T09:41:42.7924347Z def __init__(self, partitions): 2025-12-04T09:41:42.7924771Z self.partitions = partitions 2025-12-04T09:41:42.7925047Z 2025-12-04T09:41:42.7925179Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.7925480Z new_callables = [] 2025-12-04T09:41:42.7925898Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.7926348Z new_callables.append(fn(c)) 2025-12-04T09:41:42.7926771Z self.partitions = new_callables 2025-12-04T09:41:42.7927101Z 2025-12-04T09:41:42.7927235Z def call(self, args): 2025-12-04T09:41:42.7927591Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.7928113Z args.clear() 2025-12-04T09:41:42.7928490Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.7928987Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.7929465Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.7929890Z torch.cuda.set_device(0) 2025-12-04T09:41:42.7930380Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.7930893Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.7931327Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.7931706Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.7932087Z del arg0_1 2025-12-04T09:41:42.7932537Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.7933312Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.7933967Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.7934507Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.7934973Z del arg1_1 2025-12-04T09:41:42.7935293Z del buf0 2025-12-04T09:41:42.7935607Z return (buf1, ) 2025-12-04T09:41:42.7935808Z 2025-12-04T09:41:42.7935960Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.7936235Z call = runner.call 2025-12-04T09:41:42.7936618Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.7937048Z 2025-12-04T09:41:42.7937054Z 2025-12-04T09:41:42.7937253Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.7937784Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.7938302Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.7938913Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.7939653Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.7940216Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.7940729Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.7941105Z 2025-12-04T09:41:42.7941269Z 2025-12-04T09:41:42.7941404Z if __name__ == "__main__": 2025-12-04T09:41:42.7941828Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.7942308Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.7942854Z From CHECK: .to( 2025-12-04T09:41:42.7943038Z 2025-12-04T09:41:42.7943043Z 2025-12-04T09:41:42.7943293Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.7944310Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.7945238Z 2025-12-04T09:41:42.7945548Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.7946266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.7946821Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.7947315Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.7947991Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.7950391Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.7952579Z graph_break [] 2025-12-04T09:41:42.7952819Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.7953512Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.7954049Z Autotune Choices Stats: 2025-12-04T09:41:42.7955374Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.7956803Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.7957191Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.7957561Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.7958530Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.7959916Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.7961328Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.7962347Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.7963357Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.7964671Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.7966154Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.7967697Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.7968767Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.7970076Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.7971218Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.7971957Z Autotune Choices Stats: 2025-12-04T09:41:42.7973323Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8022782Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8023215Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8023609Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8024598Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8026084Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8027548Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8029248Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8030681Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8032123Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8033557Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8034981Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8036467Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8037946Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8039201Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.8040258Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.8040902Z Traceback (most recent call last): 2025-12-04T09:41:42.8041729Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8042730Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.8043628Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.8044500Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.8045250Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.8045610Z Searched string: 2025-12-04T09:41:42.8045872Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.8046104Z 2025-12-04T09:41:42.8046225Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.8046453Z 2025-12-04T09:41:42.8046579Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8046989Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8047220Z 2025-12-04T09:41:42.8047320Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.8047585Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.8047850Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8048113Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.8048290Z 2025-12-04T09:41:42.8048377Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.8048624Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.8048900Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8049173Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.8049357Z 2025-12-04T09:41:42.8049361Z 2025-12-04T09:41:42.8049514Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.8049796Z 2025-12-04T09:41:42.8049801Z 2025-12-04T09:41:42.8049921Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.8050257Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.8050577Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.8050881Z idx_m = rm[:, None] 2025-12-04T09:41:42.8051108Z idx_n = rn[None, :] 2025-12-04T09:41:42.8051358Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.8051547Z 2025-12-04T09:41:42.8051646Z # inductor generates a suffix 2025-12-04T09:41:42.8052002Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8052390Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.8052789Z ''', device_str='cuda') 2025-12-04T09:41:42.8052940Z 2025-12-04T09:41:42.8052944Z 2025-12-04T09:41:42.8053047Z async_compile.wait(globals()) 2025-12-04T09:41:42.8053312Z del async_compile 2025-12-04T09:41:42.8053443Z 2025-12-04T09:41:42.8053534Z class Runner: 2025-12-04T09:41:42.8053760Z def __init__(self, partitions): 2025-12-04T09:41:42.8054070Z self.partitions = partitions 2025-12-04T09:41:42.8054270Z 2025-12-04T09:41:42.8054392Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.8054689Z new_callables = [] 2025-12-04T09:41:42.8054973Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.8055309Z new_callables.append(fn(c)) 2025-12-04T09:41:42.8055612Z self.partitions = new_callables 2025-12-04T09:41:42.8055811Z 2025-12-04T09:41:42.8055905Z def call(self, args): 2025-12-04T09:41:42.8056158Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.8056399Z args.clear() 2025-12-04T09:41:42.8056657Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8057023Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8057378Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.8057668Z torch.cuda.set_device(0) 2025-12-04T09:41:42.8058025Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8058542Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.8058973Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8059363Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.8059745Z del arg0_1 2025-12-04T09:41:42.8060064Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8060591Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.8061050Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8061450Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.8061957Z del arg1_1 2025-12-04T09:41:42.8062196Z del buf0 2025-12-04T09:41:42.8062429Z return (buf1, ) 2025-12-04T09:41:42.8062569Z 2025-12-04T09:41:42.8062671Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.8062950Z call = runner.call 2025-12-04T09:41:42.8063257Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.8063616Z 2025-12-04T09:41:42.8063622Z 2025-12-04T09:41:42.8063815Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.8064287Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.8064779Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.8065296Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.8066003Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.8066540Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.8067038Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.8067388Z 2025-12-04T09:41:42.8067392Z 2025-12-04T09:41:42.8067501Z if __name__ == "__main__": 2025-12-04T09:41:42.8067940Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.8068574Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.8069093Z From CHECK: .to( 2025-12-04T09:41:42.8069286Z 2025-12-04T09:41:42.8069291Z 2025-12-04T09:41:42.8069544Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.8070397Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8071269Z 2025-12-04T09:41:42.8071529Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.8072239Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8072745Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8073193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8073891Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8076232Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8078508Z graph_break [] 2025-12-04T09:41:42.8078834Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8079428Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8080070Z Autotune Choices Stats: 2025-12-04T09:41:42.8081495Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8082922Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8083278Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8083680Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8084592Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8085913Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8087557Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8088927Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8090301Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8091910Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8093705Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8095196Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8096719Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8098442Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8099849Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.8100965Z Autotune Choices Stats: 2025-12-04T09:41:42.8102352Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8103653Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8104122Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8104761Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8105925Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8107524Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8109182Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8110852Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8112539Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8114074Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8115668Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8117207Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8118841Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8120629Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8122086Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.8122895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8123457Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8123964Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8124953Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8126941Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8128678Z graph_break [] 2025-12-04T09:41:42.8129231Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8129896Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8130443Z Autotune Choices Stats: 2025-12-04T09:41:42.8132067Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8133577Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8134066Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8134495Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8135285Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8136599Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8137779Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8138968Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8140181Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8141435Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8142638Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8143812Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8145041Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8146166Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8147328Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.8148130Z Autotune Choices Stats: 2025-12-04T09:41:42.8149308Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8150430Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8150838Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8151213Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8152049Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8153193Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8154410Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8155631Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8156835Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8158223Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8159738Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8161007Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8162206Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8163649Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8164699Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.8165482Z Autotune Choices Stats: 2025-12-04T09:41:42.8166573Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.8167727Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8168238Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8168705Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8169791Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8171475Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8173059Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8174925Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8176475Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8178208Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8179871Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8181494Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8183104Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8184785Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8186235Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.8187545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8188209Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8188941Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8189728Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8191707Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8193541Z graph_break [] 2025-12-04T09:41:42.8193939Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8194682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8195397Z Autotune Choices Stats: 2025-12-04T09:41:42.8196905Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.8198496Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8198993Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8199719Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8201129Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8202838Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8204442Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8206056Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8208124Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8209772Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8211355Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8213050Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8214670Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8216294Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8217958Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.8218787Z Autotune Choices Stats: 2025-12-04T09:41:42.8220432Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.8221966Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8222589Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8223094Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8224390Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8225975Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8303039Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8304527Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8306015Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8307572Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8309096Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8310548Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8312066Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8313642Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8314801Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.8315334Z Autotune Choices Stats: 2025-12-04T09:41:42.8316711Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8317766Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8318034Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8318311Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8319011Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8320212Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8321308Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8322378Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8323462Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8324509Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8325719Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8326785Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8327845Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8328891Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8329984Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.8330752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8331260Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8335837Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8336343Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8338085Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8339645Z graph_break [] 2025-12-04T09:41:42.8339883Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8361110Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8361608Z Autotune Choices Stats: 2025-12-04T09:41:42.8362848Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8363961Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8384980Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8385299Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8386040Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8387242Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8389515Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8391650Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8393737Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8395806Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8397562Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8398713Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8399808Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8401091Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8401990Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.8402512Z Autotune Choices Stats: 2025-12-04T09:41:42.8403506Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8404535Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8404799Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8405071Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8405742Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8406801Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8407899Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8408953Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8410195Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8411229Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8412268Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8413317Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8414375Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8415451Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8416363Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.8417020Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.8417502Z Traceback (most recent call last): 2025-12-04T09:41:42.8418109Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8418810Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.8419438Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.8420185Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.8420634Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.8420992Z Searched string: 2025-12-04T09:41:42.8421259Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.8421497Z 2025-12-04T09:41:42.8421617Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.8421828Z 2025-12-04T09:41:42.8421965Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8422316Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8422548Z 2025-12-04T09:41:42.8422649Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.8422921Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.8423179Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8423443Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.8423622Z 2025-12-04T09:41:42.8423711Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.8423974Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.8424236Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8424509Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.8424681Z 2025-12-04T09:41:42.8424685Z 2025-12-04T09:41:42.8424853Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.8425114Z 2025-12-04T09:41:42.8425118Z 2025-12-04T09:41:42.8425248Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.8425580Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.8425917Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.8426223Z idx_m = rm[:, None] 2025-12-04T09:41:42.8426451Z idx_n = rn[None, :] 2025-12-04T09:41:42.8426695Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.8426875Z 2025-12-04T09:41:42.8426980Z # inductor generates a suffix 2025-12-04T09:41:42.8427298Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8427684Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.8428090Z ''', device_str='cuda') 2025-12-04T09:41:42.8428236Z 2025-12-04T09:41:42.8428239Z 2025-12-04T09:41:42.8428347Z async_compile.wait(globals()) 2025-12-04T09:41:42.8428600Z del async_compile 2025-12-04T09:41:42.8428902Z 2025-12-04T09:41:42.8428988Z class Runner: 2025-12-04T09:41:42.8429216Z def __init__(self, partitions): 2025-12-04T09:41:42.8429506Z self.partitions = partitions 2025-12-04T09:41:42.8429702Z 2025-12-04T09:41:42.8429813Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.8430107Z new_callables = [] 2025-12-04T09:41:42.8430385Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.8430705Z new_callables.append(fn(c)) 2025-12-04T09:41:42.8431017Z self.partitions = new_callables 2025-12-04T09:41:42.8431214Z 2025-12-04T09:41:42.8431314Z def call(self, args): 2025-12-04T09:41:42.8431555Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.8431812Z args.clear() 2025-12-04T09:41:42.8432085Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8432427Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8432761Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.8433068Z torch.cuda.set_device(0) 2025-12-04T09:41:42.8433421Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8433916Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.8434346Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8434734Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.8435108Z del arg0_1 2025-12-04T09:41:42.8435419Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8435944Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.8436483Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8436899Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.8437313Z del arg1_1 2025-12-04T09:41:42.8437538Z del buf0 2025-12-04T09:41:42.8437774Z return (buf1, ) 2025-12-04T09:41:42.8437951Z 2025-12-04T09:41:42.8438067Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.8438334Z call = runner.call 2025-12-04T09:41:42.8438624Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.8438887Z 2025-12-04T09:41:42.8438891Z 2025-12-04T09:41:42.8439029Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.8439400Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.8439853Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.8440291Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.8440804Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.8441221Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.8441573Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.8441840Z 2025-12-04T09:41:42.8441844Z 2025-12-04T09:41:42.8441939Z if __name__ == "__main__": 2025-12-04T09:41:42.8442301Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.8442763Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.8443104Z From CHECK: .to( 2025-12-04T09:41:42.8443241Z 2025-12-04T09:41:42.8443245Z 2025-12-04T09:41:42.8443422Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.8444258Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8444904Z 2025-12-04T09:41:42.8445130Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.8445628Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8446009Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8446411Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8446895Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8448638Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8450193Z graph_break [] 2025-12-04T09:41:42.8450434Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8450816Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8451189Z Autotune Choices Stats: 2025-12-04T09:41:42.8452196Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8453229Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8453485Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8453758Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8454445Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8455499Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8456688Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8457724Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8458739Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8459766Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8460796Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8461823Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8462844Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8463869Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8464772Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.8465302Z Autotune Choices Stats: 2025-12-04T09:41:42.8466292Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8467332Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8467715Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8467976Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8468648Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8469692Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8470713Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8471728Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8472745Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8473758Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8474782Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8475805Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8476922Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8477955Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8478863Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.8479522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8479910Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8480215Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8480700Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8482004Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8483143Z graph_break [] 2025-12-04T09:41:42.8483376Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8483760Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8484133Z Autotune Choices Stats: 2025-12-04T09:41:42.8485120Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8486131Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8486391Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8486661Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8487340Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8488471Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8489522Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8490576Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8491950Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8493130Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8494286Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8495459Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8496612Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8497738Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8498895Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.8499504Z Autotune Choices Stats: 2025-12-04T09:41:42.8500798Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8501966Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8502384Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8502731Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8503542Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8504693Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8505805Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8506994Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8508168Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8509276Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8510476Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8511764Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8512946Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8514169Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8515170Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.8515857Z Autotune Choices Stats: 2025-12-04T09:41:42.8516984Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.8518091Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8518446Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8518851Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8519686Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8520902Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8521983Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8523253Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8524539Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8525672Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8526816Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8527987Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8529132Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8530267Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8531329Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.8532057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8532480Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8532957Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8533555Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8535063Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8536354Z graph_break [] 2025-12-04T09:41:42.8536703Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8537224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8537709Z Autotune Choices Stats: 2025-12-04T09:41:42.8538817Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.8539984Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8540370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8540689Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8541506Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8542686Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8543883Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8545094Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8546274Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8547411Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8548638Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8549768Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8550901Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8552081Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8553076Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.8553694Z Autotune Choices Stats: 2025-12-04T09:41:42.8554833Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.8555972Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8556332Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8556755Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8557519Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8558719Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8559986Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8561138Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8562289Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8563421Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8564564Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8565721Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8566963Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8568147Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8569301Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.8569902Z Autotune Choices Stats: 2025-12-04T09:41:42.8570977Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8572167Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8572516Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8572845Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8573707Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8574850Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8575994Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8577323Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8578412Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8579510Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8580729Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8581936Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8583246Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8584353Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8585351Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.8586107Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8586592Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8586988Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8587670Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8589513Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8591176Z graph_break [] 2025-12-04T09:41:42.8591505Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8591990Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8592481Z Autotune Choices Stats: 2025-12-04T09:41:42.8593620Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8594817Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8595167Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8595584Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8596384Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8597641Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8598823Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8600028Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8601337Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8602560Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8603657Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8604770Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8606127Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8607250Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8608307Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.8608899Z Autotune Choices Stats: 2025-12-04T09:41:42.8610217Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8611599Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8611983Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8612320Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8613266Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8614613Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8615966Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8617292Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8618549Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8619696Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8620878Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8622113Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8623244Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8624426Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8625452Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.8626128Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8626660Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8627046Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8627656Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8629136Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8630359Z graph_break [] 2025-12-04T09:41:42.8630632Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8631358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8631824Z Autotune Choices Stats: 2025-12-04T09:41:42.8632966Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.8634092Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8634439Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8634855Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8635602Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8636763Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8637964Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8639103Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8640289Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8641446Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8642753Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8656653Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8657783Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8658829Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8659741Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.8660269Z Autotune Choices Stats: 2025-12-04T09:41:42.8661276Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8662320Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8662592Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8662860Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8663540Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8664600Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8665661Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8666824Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8667865Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8668909Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8669950Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8671003Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8672063Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8673111Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8674023Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.8674548Z Autotune Choices Stats: 2025-12-04T09:41:42.8675547Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8676682Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8676955Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8677232Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8677974Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8679046Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8680180Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8681225Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8682285Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8683325Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8684362Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8685407Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8686468Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8687601Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8688511Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.8689122Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8689508Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8689825Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8690303Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8691606Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8692741Z graph_break [] 2025-12-04T09:41:42.8692979Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8693358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8693738Z Autotune Choices Stats: 2025-12-04T09:41:42.8694733Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8695752Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8696010Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8696279Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8697101Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8698157Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8699219Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8700616Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8701694Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8702763Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8703829Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8704906Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8705958Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8706998Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8707914Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.8708445Z Autotune Choices Stats: 2025-12-04T09:41:42.8709584Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.8710617Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8710884Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8711149Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8711828Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8712900Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8713985Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8715033Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8716092Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8717141Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8718297Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8719362Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8720465Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8721521Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8722428Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.8722959Z Autotune Choices Stats: 2025-12-04T09:41:42.8723947Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8724974Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8725240Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8725517Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8726199Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8727253Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8728305Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8729458Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8730536Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8731613Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8732108Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8732583Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8733068Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8733539Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8733870Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.8734056Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8734154Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8734297Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8734633Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8736019Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8736111Z graph_break [] 2025-12-04T09:41:42.8736217Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8736400Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8736497Z Autotune Choices Stats: 2025-12-04T09:41:42.8737332Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8737443Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8737538Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8737644Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8738129Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8738607Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8739088Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8739574Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8740175Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8740658Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8741139Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8741631Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8742119Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8742601Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8742934Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.8743035Z Autotune Choices Stats: 2025-12-04T09:41:42.8743873Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8743970Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8744156Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8744264Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8744746Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8745238Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8745709Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8746189Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8746660Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8747146Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8747613Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8748089Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8748572Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8749054Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8749479Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.8749707Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.8749821Z Traceback (most recent call last): 2025-12-04T09:41:42.8750238Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8750425Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.8750778Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.8750969Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.8751141Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.8751240Z Searched string: 2025-12-04T09:41:42.8751375Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.8751384Z 2025-12-04T09:41:42.8751511Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.8751520Z 2025-12-04T09:41:42.8751654Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8751782Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8751786Z 2025-12-04T09:41:42.8751890Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.8751987Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.8752087Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8752192Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.8752197Z 2025-12-04T09:41:42.8752286Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.8752391Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.8752487Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8752580Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.8752664Z 2025-12-04T09:41:42.8752668Z 2025-12-04T09:41:42.8752848Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.8752853Z 2025-12-04T09:41:42.8752857Z 2025-12-04T09:41:42.8752983Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.8753117Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.8753236Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.8753328Z idx_m = rm[:, None] 2025-12-04T09:41:42.8753420Z idx_n = rn[None, :] 2025-12-04T09:41:42.8753521Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.8753525Z 2025-12-04T09:41:42.8753624Z # inductor generates a suffix 2025-12-04T09:41:42.8753725Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8753943Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.8754042Z ''', device_str='cuda') 2025-12-04T09:41:42.8754047Z 2025-12-04T09:41:42.8754055Z 2025-12-04T09:41:42.8754155Z async_compile.wait(globals()) 2025-12-04T09:41:42.8754239Z del async_compile 2025-12-04T09:41:42.8754244Z 2025-12-04T09:41:42.8754330Z class Runner: 2025-12-04T09:41:42.8754433Z def __init__(self, partitions): 2025-12-04T09:41:42.8754543Z self.partitions = partitions 2025-12-04T09:41:42.8754548Z 2025-12-04T09:41:42.8754665Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.8754761Z new_callables = [] 2025-12-04T09:41:42.8754889Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.8754998Z new_callables.append(fn(c)) 2025-12-04T09:41:42.8755102Z self.partitions = new_callables 2025-12-04T09:41:42.8755107Z 2025-12-04T09:41:42.8755203Z def call(self, args): 2025-12-04T09:41:42.8755297Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.8755385Z args.clear() 2025-12-04T09:41:42.8755524Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8755654Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8755768Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.8755873Z torch.cuda.set_device(0) 2025-12-04T09:41:42.8756042Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8756440Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.8756543Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8756735Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.8756830Z del arg0_1 2025-12-04T09:41:42.8756995Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8757250Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.8757357Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8757577Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.8757673Z del arg1_1 2025-12-04T09:41:42.8757755Z del buf0 2025-12-04T09:41:42.8757843Z return (buf1, ) 2025-12-04T09:41:42.8757848Z 2025-12-04T09:41:42.8757958Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.8758051Z call = runner.call 2025-12-04T09:41:42.8758212Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.8758217Z 2025-12-04T09:41:42.8758221Z 2025-12-04T09:41:42.8758368Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.8758503Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.8758660Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.8758863Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.8759067Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.8759179Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.8759437Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.8759442Z 2025-12-04T09:41:42.8759446Z 2025-12-04T09:41:42.8759684Z if __name__ == "__main__": 2025-12-04T09:41:42.8759897Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.8760060Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.8760154Z From CHECK: .to( 2025-12-04T09:41:42.8760158Z 2025-12-04T09:41:42.8760162Z 2025-12-04T09:41:42.8760337Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.8760890Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8760903Z 2025-12-04T09:41:42.8761124Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.8761305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8761417Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8761550Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8761802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8763194Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8763283Z graph_break [] 2025-12-04T09:41:42.8763397Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8763575Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8763670Z Autotune Choices Stats: 2025-12-04T09:41:42.8764630Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8764730Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8764829Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8764940Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8765428Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8765898Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8766364Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8766851Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8767310Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8767783Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8768252Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8768710Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8769268Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8769736Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8770081Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.8770178Z Autotune Choices Stats: 2025-12-04T09:41:42.8771007Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8771119Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8771212Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8771326Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8771810Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8772271Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8772738Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8773199Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8773669Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8774239Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8774708Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8775174Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8775645Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8776123Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8776463Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.8776647Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8776744Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8776880Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8777133Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8778077Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8778250Z graph_break [] 2025-12-04T09:41:42.8778358Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8778539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8778639Z Autotune Choices Stats: 2025-12-04T09:41:42.8779465Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8779567Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8779656Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8779763Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8780259Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8780749Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8781226Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8781711Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8782190Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8782684Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8783241Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8783718Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8784197Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8784663Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8785003Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.8785105Z Autotune Choices Stats: 2025-12-04T09:41:42.8785951Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8786055Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8786147Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8786260Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8786735Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8787205Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8787828Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8788298Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8788767Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8789235Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8789707Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8790184Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8790659Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8791139Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8791474Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.8791576Z Autotune Choices Stats: 2025-12-04T09:41:42.8792412Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.8792511Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8792608Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8792807Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8793295Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8793760Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8794224Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8794702Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8795184Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8795665Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8796142Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8796620Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8797173Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8797689Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8798030Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.8798208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8798312Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8798445Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8798694Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8799741Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8799831Z graph_break [] 2025-12-04T09:41:42.8799948Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8800128Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8800221Z Autotune Choices Stats: 2025-12-04T09:41:42.8801261Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.8801357Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8801451Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8801559Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8802043Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8802722Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8803188Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8803666Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8804138Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8804615Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8805096Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8805569Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8806043Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8806506Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8807019Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.8807132Z Autotune Choices Stats: 2025-12-04T09:41:42.8807965Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.8808068Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8808159Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8808272Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8808744Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8809216Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8809691Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8810155Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8810623Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8811086Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8811564Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8812134Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8812608Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8813085Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8813416Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.8813519Z Autotune Choices Stats: 2025-12-04T09:41:42.8814366Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8814464Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8814561Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8814675Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8815164Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8815640Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8816122Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8816685Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8817192Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8817668Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8818132Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8818604Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8819075Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8819547Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8819887Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.8820064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8820170Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8820307Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8820557Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8822098Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8822188Z graph_break [] 2025-12-04T09:41:42.8822304Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8822484Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8822581Z Autotune Choices Stats: 2025-12-04T09:41:42.8823430Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8823531Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8823628Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8823738Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8824224Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8824697Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8825162Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8825632Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8826176Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8826649Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8827131Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8827603Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8828086Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8828557Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8829170Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.8829268Z Autotune Choices Stats: 2025-12-04T09:41:42.8830119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8830224Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8830314Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8830427Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8830914Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8831475Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8831958Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8832423Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8832893Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8833361Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8833834Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8834319Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8834795Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8835274Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8835687Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.8835873Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8835969Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8836102Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8836356Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8837297Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8837389Z graph_break [] 2025-12-04T09:41:42.8837497Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8837678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8837779Z Autotune Choices Stats: 2025-12-04T09:41:42.8838622Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.8838717Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8838812Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8838920Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8839410Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8839942Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8840504Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8840984Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8841453Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8841929Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8842398Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8842888Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8843361Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8843835Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8844172Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.8844266Z Autotune Choices Stats: 2025-12-04T09:41:42.8845101Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8845310Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8845400Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8845513Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8845986Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8846469Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8846944Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8847417Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8847897Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8848369Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8848844Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8849318Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8849880Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8850358Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8850692Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.8850794Z Autotune Choices Stats: 2025-12-04T09:41:42.8851645Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8851756Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8851847Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8851962Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8852457Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8852935Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8853411Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8853881Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8854442Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8854918Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8855390Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8855869Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8856343Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8856827Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8857165Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.8857343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8857447Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8857583Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8857839Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8858778Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8858869Z graph_break [] 2025-12-04T09:41:42.8858980Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8859241Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8859347Z Autotune Choices Stats: 2025-12-04T09:41:42.8860178Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8860274Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8860372Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8860480Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8860956Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8861452Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8861933Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8862423Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8862904Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8863394Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8863952Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8864442Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8864915Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8865392Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8865729Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.8865828Z Autotune Choices Stats: 2025-12-04T09:41:42.8866681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.8866778Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8866869Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8866982Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8867459Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8867941Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8868421Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8868975Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8869458Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8869929Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8870405Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8870888Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8871374Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8871851Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8872186Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.8872290Z Autotune Choices Stats: 2025-12-04T09:41:42.8873118Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8873300Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8873395Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8873510Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8873993Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8874468Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8874952Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8875438Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8875924Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8876414Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8876900Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8877377Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8877856Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8878439Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8878774Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.8878952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8879055Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8879188Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8879444Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8880884Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8880974Z graph_break [] 2025-12-04T09:41:42.8881087Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8881266Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8881373Z Autotune Choices Stats: 2025-12-04T09:41:42.8882209Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8882386Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8882483Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8882593Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8883073Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8883559Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8884034Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8884520Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8885005Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8885495Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8885975Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8886472Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8886952Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8887430Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8887903Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.8888000Z Autotune Choices Stats: 2025-12-04T09:41:42.8888868Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8888970Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8889060Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8889173Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8889652Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8890142Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8890617Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8891088Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8891563Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8892115Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8892599Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8893078Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8893561Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8894038Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8894378Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.8894562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8894662Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8894805Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8895054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8896439Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8896538Z graph_break [] 2025-12-04T09:41:42.8896644Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8896827Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8896928Z Autotune Choices Stats: 2025-12-04T09:41:42.8897859Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8897964Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8898054Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8898216Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8898770Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8899551Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8900083Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8900841Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8901381Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8901880Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8902629Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8903206Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8903815Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8904318Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8904683Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:42.8904868Z Autotune Choices Stats: 2025-12-04T09:41:42.8905749Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8905984Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8906105Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8906245Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8906817Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8907325Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8907911Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8908593Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8909157Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8909741Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8910325Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8910939Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8911598Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8912236Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8912679Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:42.8912968Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.8913154Z Traceback (most recent call last): 2025-12-04T09:41:42.8913655Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8914165Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.8914601Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.8914839Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.8915088Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.8915205Z Searched string: 2025-12-04T09:41:42.8915451Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.8915458Z 2025-12-04T09:41:42.8915657Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.8915661Z 2025-12-04T09:41:42.8915830Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8916030Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.8916034Z 2025-12-04T09:41:42.8916162Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.8916313Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.8916498Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8916662Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.8916667Z 2025-12-04T09:41:42.8916823Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.8916952Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.8917081Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8917233Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.8917238Z 2025-12-04T09:41:42.8917241Z 2025-12-04T09:41:42.8917568Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.8917574Z 2025-12-04T09:41:42.8917578Z 2025-12-04T09:41:42.8917786Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.8917939Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.8918090Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.8918244Z idx_m = rm[:, None] 2025-12-04T09:41:42.8918346Z idx_n = rn[None, :] 2025-12-04T09:41:42.8918609Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.8918663Z 2025-12-04T09:41:42.8918796Z # inductor generates a suffix 2025-12-04T09:41:42.8918921Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.8919309Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.8919455Z ''', device_str='cuda') 2025-12-04T09:41:42.8919460Z 2025-12-04T09:41:42.8919464Z 2025-12-04T09:41:42.8919646Z async_compile.wait(globals()) 2025-12-04T09:41:42.8919890Z del async_compile 2025-12-04T09:41:42.8919895Z 2025-12-04T09:41:42.8920008Z class Runner: 2025-12-04T09:41:42.8920174Z def __init__(self, partitions): 2025-12-04T09:41:42.8920308Z self.partitions = partitions 2025-12-04T09:41:42.8920313Z 2025-12-04T09:41:42.8920478Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.8920675Z new_callables = [] 2025-12-04T09:41:42.8920842Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.8920980Z new_callables.append(fn(c)) 2025-12-04T09:41:42.8921160Z self.partitions = new_callables 2025-12-04T09:41:42.8921165Z 2025-12-04T09:41:42.8921309Z def call(self, args): 2025-12-04T09:41:42.8921449Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.8921614Z args.clear() 2025-12-04T09:41:42.8921796Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8921989Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.8922126Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.8922279Z torch.cuda.set_device(0) 2025-12-04T09:41:42.8922545Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8922857Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.8952694Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8952950Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.8953042Z del arg0_1 2025-12-04T09:41:42.8953395Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.8953666Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.8953777Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.8954002Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.8954094Z del arg1_1 2025-12-04T09:41:42.8954177Z del buf0 2025-12-04T09:41:42.8954272Z return (buf1, ) 2025-12-04T09:41:42.8954277Z 2025-12-04T09:41:42.8954385Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.8954474Z call = runner.call 2025-12-04T09:41:42.8954644Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.8954649Z 2025-12-04T09:41:42.8954653Z 2025-12-04T09:41:42.8954795Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.8954932Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.8955099Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.8955304Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.8955520Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.8955627Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.8955794Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.8955799Z 2025-12-04T09:41:42.8955803Z 2025-12-04T09:41:42.8955903Z if __name__ == "__main__": 2025-12-04T09:41:42.8956107Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.8956270Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.8956364Z From CHECK: .to( 2025-12-04T09:41:42.8956369Z 2025-12-04T09:41:42.8956373Z 2025-12-04T09:41:42.8956547Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.8957115Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.8957124Z 2025-12-04T09:41:42.8957362Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.8957671Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8957774Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8957912Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8958168Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8959649Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8959754Z graph_break [] 2025-12-04T09:41:42.8959863Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8960044Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8960148Z Autotune Choices Stats: 2025-12-04T09:41:42.8960999Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.8961096Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8961194Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8961308Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8961802Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8963651Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8964130Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8964600Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8965061Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8965544Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8966014Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8966481Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8966943Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8967410Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8967754Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.8967853Z Autotune Choices Stats: 2025-12-04T09:41:42.8968773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8968873Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8968963Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8969079Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8969562Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8970031Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8970497Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8970961Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8971429Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8971889Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8972355Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8972904Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8973383Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8973849Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8974183Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.8974368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8974468Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8974610Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8974863Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8975816Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8975909Z graph_break [] 2025-12-04T09:41:42.8976017Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8976200Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8976296Z Autotune Choices Stats: 2025-12-04T09:41:42.8977146Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8977253Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8977344Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8977453Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8978018Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8978496Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8978977Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8979457Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8979947Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8980434Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8980903Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8981386Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8981855Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8982408Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8982744Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.8982845Z Autotune Choices Stats: 2025-12-04T09:41:42.8983676Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.8983778Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8983878Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8983988Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.8984473Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8984954Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8985427Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8985905Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8986368Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.8986844Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8987517Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8987990Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8988469Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8988943Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8989285Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.8989384Z Autotune Choices Stats: 2025-12-04T09:41:42.8990246Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.8990344Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8990435Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8990559Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.8991043Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8991518Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.8992066Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.8992539Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8993020Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8993500Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.8993983Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8994461Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.8994937Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.8995402Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.8995735Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.8995918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.8996019Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.8996164Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.8996496Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.8997444Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.8997538Z graph_break [] 2025-12-04T09:41:42.8997646Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.8997826Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.8997928Z Autotune Choices Stats: 2025-12-04T09:41:42.8998782Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.8998893Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.8998990Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.8999103Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.8999697Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9000167Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9001127Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9001796Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9002279Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9002749Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9003220Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9003706Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9004177Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9004653Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9004990Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.9005085Z Autotune Choices Stats: 2025-12-04T09:41:42.9005924Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.9006019Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9006124Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9006232Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9006817Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9007303Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9007768Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9008242Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9008709Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9009186Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9009649Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9010124Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9010599Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9011153Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9011493Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.9011591Z Autotune Choices Stats: 2025-12-04T09:41:42.9012431Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9012536Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9012626Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9012752Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9013234Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9013732Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9014217Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9014685Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9015158Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9015621Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9016178Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9016645Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9017112Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9017590Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9017925Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.9018113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9018214Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9018351Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9018612Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9020000Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9020094Z graph_break [] 2025-12-04T09:41:42.9020205Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9020463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9020567Z Autotune Choices Stats: 2025-12-04T09:41:42.9021416Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9021526Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9021620Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9021730Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9022224Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9022693Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9023174Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9023641Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9024107Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9024586Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9025058Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9025641Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9026115Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9026586Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9026926Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.9027025Z Autotune Choices Stats: 2025-12-04T09:41:42.9027874Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9027981Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9028080Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9028193Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9028666Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9029146Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9029625Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9030174Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9030643Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9031119Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9031591Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9032073Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9032558Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9033039Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9033383Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.9033561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9033659Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9033801Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9034051Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9035088Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9035178Z graph_break [] 2025-12-04T09:41:42.9035288Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9035471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9035568Z Autotune Choices Stats: 2025-12-04T09:41:42.9036417Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.9036517Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9036611Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9036729Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9037226Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9037755Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9038237Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9038715Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9039194Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9039804Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9040286Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9040762Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9041253Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9041728Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9042074Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.9042184Z Autotune Choices Stats: 2025-12-04T09:41:42.9043016Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9043119Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9043211Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9043323Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9043811Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9044296Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9044863Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9045334Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9045808Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9046282Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9046754Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9047239Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9047714Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9048200Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9048534Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.9048708Z Autotune Choices Stats: 2025-12-04T09:41:42.9049563Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9049661Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9049761Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9049877Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9050365Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9050854Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9051328Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9051809Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9052287Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9052761Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9053238Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9053716Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9054276Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9054751Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9055092Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.9055268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9055369Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9055516Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9055774Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9056729Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9056816Z graph_break [] 2025-12-04T09:41:42.9056925Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9057108Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9057205Z Autotune Choices Stats: 2025-12-04T09:41:42.9058042Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9058250Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9058344Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9058461Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9058946Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9059429Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9059914Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9060397Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9060895Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9061378Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9061860Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9062343Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9062814Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9063380Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9063721Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.9063824Z Autotune Choices Stats: 2025-12-04T09:41:42.9064656Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.9064752Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9064851Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9064960Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9065448Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9065933Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9066411Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9066896Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9067374Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9067927Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9068403Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9068884Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9069362Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9069836Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9070177Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.9070276Z Autotune Choices Stats: 2025-12-04T09:41:42.9071121Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9071218Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9071310Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9071429Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9071911Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9072396Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9072952Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9073444Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9073924Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9074406Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9074904Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9075383Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9075865Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9076336Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9076676Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.9076932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9077030Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9077174Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9077425Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9078800Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9078892Z graph_break [] 2025-12-04T09:41:42.9079001Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9079186Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9079285Z Autotune Choices Stats: 2025-12-04T09:41:42.9080183Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9080285Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9080377Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9080490Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9080974Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9081448Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9081936Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9082499Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9082988Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9083470Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9083961Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9084452Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9084938Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9085421Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9085757Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.9085858Z Autotune Choices Stats: 2025-12-04T09:41:42.9086699Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9086881Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9086977Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9087085Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9087578Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9088056Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9088529Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9089025Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9089499Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9089975Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9090446Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9090927Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9091407Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9091995Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9092333Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.9092509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9092615Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9092752Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9093003Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9094391Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9094485Z graph_break [] 2025-12-04T09:41:42.9094597Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9094774Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9094874Z Autotune Choices Stats: 2025-12-04T09:41:42.9095733Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9095933Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9096032Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9096144Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9096637Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9097126Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9097600Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9098080Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9098556Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9099041Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9099520Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9099997Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9100740Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9101223Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9101702Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:42.9101804Z Autotune Choices Stats: 2025-12-04T09:41:42.9102661Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9102766Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9102858Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9102971Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9103459Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9103945Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9104428Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9104903Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9105380Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9105850Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9106447Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9106979Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9107454Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9107936Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9108276Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:42.9108459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9108557Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9108695Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9108952Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9110334Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9110427Z graph_break [] 2025-12-04T09:41:42.9110539Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9110715Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9110815Z Autotune Choices Stats: 2025-12-04T09:41:42.9111734Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:42.9111839Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9111932Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9112041Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9112529Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9113007Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9113495Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9113975Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9114453Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9114937Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9115419Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9115976Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9116448Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9116925Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9117260Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:42.9117356Z Autotune Choices Stats: 2025-12-04T09:41:42.9118212Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9118308Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9118404Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9118514Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9118996Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9119526Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9120002Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9120567Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9121036Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9121511Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9121981Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9122458Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9122945Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9123420Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9123762Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:42.9123984Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.9124089Z Traceback (most recent call last): 2025-12-04T09:41:42.9124518Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9124785Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.9125142Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.9125334Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.9125498Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.9125590Z Searched string: 2025-12-04T09:41:42.9125725Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.9125731Z 2025-12-04T09:41:42.9125851Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.9125862Z 2025-12-04T09:41:42.9125992Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9126119Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9126124Z 2025-12-04T09:41:42.9126225Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.9126315Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.9126418Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9126520Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.9126524Z 2025-12-04T09:41:42.9126614Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.9126707Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.9126809Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9126899Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.9126903Z 2025-12-04T09:41:42.9126907Z 2025-12-04T09:41:42.9127075Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.9127079Z 2025-12-04T09:41:42.9127083Z 2025-12-04T09:41:42.9127204Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.9127324Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.9127448Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.9127536Z idx_m = rm[:, None] 2025-12-04T09:41:42.9127630Z idx_n = rn[None, :] 2025-12-04T09:41:42.9127728Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.9127739Z 2025-12-04T09:41:42.9127837Z # inductor generates a suffix 2025-12-04T09:41:42.9127935Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9128147Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.9128348Z ''', device_str='cuda') 2025-12-04T09:41:42.9128353Z 2025-12-04T09:41:42.9128364Z 2025-12-04T09:41:42.9128464Z async_compile.wait(globals()) 2025-12-04T09:41:42.9128550Z del async_compile 2025-12-04T09:41:42.9128554Z 2025-12-04T09:41:42.9128641Z class Runner: 2025-12-04T09:41:42.9128743Z def __init__(self, partitions): 2025-12-04T09:41:42.9128848Z self.partitions = partitions 2025-12-04T09:41:42.9128852Z 2025-12-04T09:41:42.9128968Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.9129060Z new_callables = [] 2025-12-04T09:41:42.9129178Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.9129291Z new_callables.append(fn(c)) 2025-12-04T09:41:42.9129396Z self.partitions = new_callables 2025-12-04T09:41:42.9129404Z 2025-12-04T09:41:42.9129504Z def call(self, args): 2025-12-04T09:41:42.9129593Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.9129677Z args.clear() 2025-12-04T09:41:42.9129813Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9129941Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9130049Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.9130157Z torch.cuda.set_device(0) 2025-12-04T09:41:42.9130325Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9130546Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.9130650Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9130842Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.9130932Z del arg0_1 2025-12-04T09:41:42.9131095Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9131431Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.9131538Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9131761Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:42.9131844Z del arg1_1 2025-12-04T09:41:42.9131928Z del buf0 2025-12-04T09:41:42.9132016Z return (buf1, ) 2025-12-04T09:41:42.9132021Z 2025-12-04T09:41:42.9132127Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.9132212Z call = runner.call 2025-12-04T09:41:42.9132372Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.9132376Z 2025-12-04T09:41:42.9132380Z 2025-12-04T09:41:42.9132524Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.9132654Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.9132805Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.9133018Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.9133220Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.9133329Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.9133492Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.9133496Z 2025-12-04T09:41:42.9133500Z 2025-12-04T09:41:42.9133590Z if __name__ == "__main__": 2025-12-04T09:41:42.9133799Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.9133957Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.9134041Z From CHECK: .to( 2025-12-04T09:41:42.9134052Z 2025-12-04T09:41:42.9134055Z 2025-12-04T09:41:42.9134229Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.9134778Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9134787Z 2025-12-04T09:41:42.9135009Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.9135327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9135430Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9135562Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9135815Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9137228Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9137330Z graph_break [] 2025-12-04T09:41:42.9137457Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9137630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9137728Z Autotune Choices Stats: 2025-12-04T09:41:42.9138569Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9138663Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9138751Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9138865Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9139349Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9139894Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9140360Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9140825Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9141286Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9141758Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9142227Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9142691Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9143157Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9143624Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9143963Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.9144062Z Autotune Choices Stats: 2025-12-04T09:41:42.9144988Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9145092Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9145181Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9145290Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9145776Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9146237Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9146710Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9147170Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9147635Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9148091Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9148550Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9149100Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9149573Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9150045Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9150379Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.9150561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9150658Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9150795Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9151058Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9152004Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9152090Z graph_break [] 2025-12-04T09:41:42.9152203Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9152380Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9152479Z Autotune Choices Stats: 2025-12-04T09:41:42.9153310Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9153411Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9153505Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9153610Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9154173Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9154656Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9155124Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9155605Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9156087Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9156577Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9157042Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9157519Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9157988Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9158528Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9158869Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.9158964Z Autotune Choices Stats: 2025-12-04T09:41:42.9159863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9159958Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9160045Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9160157Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9160628Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9161105Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9161578Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9162044Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9162516Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9162983Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9163557Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9164029Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9164508Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9164976Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9165310Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.9165416Z Autotune Choices Stats: 2025-12-04T09:41:42.9166251Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.9166355Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9166442Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9166555Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9167091Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9167556Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9168114Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9168583Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9169053Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9169530Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9169995Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9170482Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9170949Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9171420Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9171756Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.9171934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9172041Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9172173Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9172429Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9173441Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9173528Z graph_break [] 2025-12-04T09:41:42.9173646Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9173822Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9173914Z Autotune Choices Stats: 2025-12-04T09:41:42.9174752Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.9174855Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9174953Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9175063Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9175542Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9176023Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9176487Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9177046Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9177520Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9178004Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9178472Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9178942Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9179415Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9179879Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9180221Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.9180316Z Autotune Choices Stats: 2025-12-04T09:41:42.9181143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.9181239Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9181329Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9181444Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9181989Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9182460Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9182929Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9183395Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9183867Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9184338Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9184805Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9185273Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9185747Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9186226Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9186633Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.9186736Z Autotune Choices Stats: 2025-12-04T09:41:42.9187570Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9187673Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9187759Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9187872Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9188355Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9188839Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9189324Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9189788Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9190253Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9190719Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9191185Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9191736Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9192200Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9192675Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9193008Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.9193190Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9193293Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9193425Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9193674Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9195052Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9195137Z graph_break [] 2025-12-04T09:41:42.9195252Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9195530Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9195623Z Autotune Choices Stats: 2025-12-04T09:41:42.9196473Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9196567Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9196664Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9196769Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9197252Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9197719Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9198187Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9198662Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9199124Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9199660Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9200130Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9200762Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9201376Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9201844Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9202183Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.9202279Z Autotune Choices Stats: 2025-12-04T09:41:42.9203128Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9203233Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9203323Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9203433Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9203906Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9204374Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9204857Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9205431Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9205903Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9206366Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9206843Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9207320Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9207799Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9208281Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9208609Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.9208792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9208888Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9209020Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9209273Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9210215Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9210390Z graph_break [] 2025-12-04T09:41:42.9216153Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9216379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9216483Z Autotune Choices Stats: 2025-12-04T09:41:42.9217494Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.9217597Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9217687Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9217807Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9218387Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9218956Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9219516Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9220071Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9220624Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9221212Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9221683Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9222164Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9222639Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9223125Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9223463Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.9223559Z Autotune Choices Stats: 2025-12-04T09:41:42.9224399Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9224493Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9224584Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9224688Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9225163Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9225648Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9226203Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9226680Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9227198Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9227666Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9228147Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9228625Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9229106Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9229579Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9229915Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.9230090Z Autotune Choices Stats: 2025-12-04T09:41:42.9230949Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9231050Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9231134Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9231250Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9231736Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9232220Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9232698Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9233167Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9233652Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9234122Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9234590Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9235067Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9235614Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9236096Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9236427Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.9236608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9236703Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9236835Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9237097Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9238043Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9238132Z graph_break [] 2025-12-04T09:41:42.9238237Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9238412Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9238508Z Autotune Choices Stats: 2025-12-04T09:41:42.9239342Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9239626Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9239713Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9239817Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9240307Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9240783Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9241256Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9241741Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9242224Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9242715Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9243186Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9243675Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9244144Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9244627Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9245065Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.9245160Z Autotune Choices Stats: 2025-12-04T09:41:42.9245994Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.9246088Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9246179Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9246284Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9246762Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9247272Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9247772Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9248250Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9248722Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9249267Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9249746Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9250222Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9250701Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9251176Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9251513Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.9251606Z Autotune Choices Stats: 2025-12-04T09:41:42.9252446Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9252546Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9252632Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9252742Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9253223Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9253701Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9254262Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9254746Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9255228Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9255710Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9256199Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9256680Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9257151Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9257626Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9257953Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.9258136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9258313Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9258447Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9258704Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9260087Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9260179Z graph_break [] 2025-12-04T09:41:42.9260283Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9260456Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9260557Z Autotune Choices Stats: 2025-12-04T09:41:42.9261389Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9261490Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9261576Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9261680Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9262159Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9262632Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9263106Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9263678Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9264160Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9264645Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9265124Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9265617Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9266103Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9266581Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9266910Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.9267002Z Autotune Choices Stats: 2025-12-04T09:41:42.9267901Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9268076Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9268168Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9268276Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9268754Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9269233Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9269701Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9270177Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9270651Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9271118Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9271591Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9272064Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9272542Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9273121Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9273461Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.9273639Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9273736Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9273877Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9274123Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9275507Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9275596Z graph_break [] 2025-12-04T09:41:42.9275701Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9275880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9275972Z Autotune Choices Stats: 2025-12-04T09:41:42.9276818Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9276919Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9277084Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9277195Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9277684Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9278159Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9278633Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9279102Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9279643Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9280119Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9280603Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9281070Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9281543Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9282024Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9282436Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:42.9282539Z Autotune Choices Stats: 2025-12-04T09:41:42.9283379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9283477Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9283563Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9283668Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9284156Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9284638Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9285117Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9285591Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9286057Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9286531Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9287166Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9287647Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9288120Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9288599Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9288931Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:42.9289112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9289212Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9289346Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9289596Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9290974Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9291059Z graph_break [] 2025-12-04T09:41:42.9291176Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9291355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9291448Z Autotune Choices Stats: 2025-12-04T09:41:42.9292370Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:42.9292466Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9292559Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9292663Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9293139Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9293623Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9294111Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9294597Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9295074Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9295565Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9296042Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9296594Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9297071Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9297541Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9297883Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:42.9297976Z Autotune Choices Stats: 2025-12-04T09:41:42.9298812Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9298920Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9299007Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9299117Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9299597Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9300071Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9300779Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9301260Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9301889Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9302362Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9302835Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9303309Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9303789Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9304269Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9304596Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:42.9304775Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9304869Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9304999Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9305253Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9306748Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9306837Z graph_break [] 2025-12-04T09:41:42.9306943Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9307117Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9307215Z Autotune Choices Stats: 2025-12-04T09:41:42.9308052Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9308156Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9308243Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9308347Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9308832Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9309308Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9309786Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9310266Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9310756Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9311340Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9311825Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9312307Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9312781Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9313267Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9313601Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:42.9313692Z Autotune Choices Stats: 2025-12-04T09:41:42.9314524Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:42.9314619Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9314710Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9314815Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9315363Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9315844Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9316314Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9316794Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9317263Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9317743Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9318218Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9318692Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9319171Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9319702Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9320047Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:42.9320350Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.9320457Z Traceback (most recent call last): 2025-12-04T09:41:42.9320881Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9321064Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.9321416Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.9321603Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.9321768Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.9321863Z Searched string: 2025-12-04T09:41:42.9322002Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.9322008Z 2025-12-04T09:41:42.9322127Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.9322137Z 2025-12-04T09:41:42.9322267Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9322399Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9322404Z 2025-12-04T09:41:42.9322504Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.9322594Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.9322687Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9322785Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.9322790Z 2025-12-04T09:41:42.9322878Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.9322978Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.9323068Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9323157Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.9323161Z 2025-12-04T09:41:42.9323165Z 2025-12-04T09:41:42.9323331Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.9323416Z 2025-12-04T09:41:42.9323420Z 2025-12-04T09:41:42.9323546Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.9323662Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.9323785Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.9323872Z idx_m = rm[:, None] 2025-12-04T09:41:42.9323964Z idx_n = rn[None, :] 2025-12-04T09:41:42.9324058Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.9324063Z 2025-12-04T09:41:42.9324160Z # inductor generates a suffix 2025-12-04T09:41:42.9324257Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9324470Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.9324556Z ''', device_str='cuda') 2025-12-04T09:41:42.9324561Z 2025-12-04T09:41:42.9324572Z 2025-12-04T09:41:42.9324671Z async_compile.wait(globals()) 2025-12-04T09:41:42.9324754Z del async_compile 2025-12-04T09:41:42.9324763Z 2025-12-04T09:41:42.9324850Z class Runner: 2025-12-04T09:41:42.9324952Z def __init__(self, partitions): 2025-12-04T09:41:42.9325055Z self.partitions = partitions 2025-12-04T09:41:42.9325060Z 2025-12-04T09:41:42.9325177Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.9325279Z new_callables = [] 2025-12-04T09:41:42.9325396Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.9325506Z new_callables.append(fn(c)) 2025-12-04T09:41:42.9325613Z self.partitions = new_callables 2025-12-04T09:41:42.9325617Z 2025-12-04T09:41:42.9325712Z def call(self, args): 2025-12-04T09:41:42.9325800Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.9325883Z args.clear() 2025-12-04T09:41:42.9326015Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9326142Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9326247Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.9326352Z torch.cuda.set_device(0) 2025-12-04T09:41:42.9326522Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9326744Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.9326929Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9327123Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.9327212Z del arg0_1 2025-12-04T09:41:42.9327376Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9327635Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.9327739Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9327956Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.9328038Z del arg1_1 2025-12-04T09:41:42.9328123Z del buf0 2025-12-04T09:41:42.9328212Z return (buf1, ) 2025-12-04T09:41:42.9328216Z 2025-12-04T09:41:42.9328324Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.9328409Z call = runner.call 2025-12-04T09:41:42.9328567Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.9328571Z 2025-12-04T09:41:42.9328579Z 2025-12-04T09:41:42.9328722Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.9328853Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.9329000Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.9329204Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.9329405Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.9329513Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.9329678Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.9329682Z 2025-12-04T09:41:42.9329686Z 2025-12-04T09:41:42.9329858Z if __name__ == "__main__": 2025-12-04T09:41:42.9330066Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.9330230Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.9330322Z From CHECK: .to( 2025-12-04T09:41:42.9330331Z 2025-12-04T09:41:42.9330334Z 2025-12-04T09:41:42.9330509Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.9331058Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9331063Z 2025-12-04T09:41:42.9331286Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.9331462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9331563Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9331697Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9331950Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9333345Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9333430Z graph_break [] 2025-12-04T09:41:42.9333541Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9333714Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9333807Z Autotune Choices Stats: 2025-12-04T09:41:42.9334653Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9334751Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9334923Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9335033Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9335519Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9335989Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9336459Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9336928Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9337388Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9337860Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9338328Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9338786Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9339377Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9339848Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9340189Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.9340281Z Autotune Choices Stats: 2025-12-04T09:41:42.9341115Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9341216Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9341308Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9341414Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9341905Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9342363Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9342826Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9343283Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9343746Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9344286Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9344745Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9345218Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9345687Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9346158Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9346493Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.9346680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9346776Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9346909Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9347165Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9348153Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9348237Z graph_break [] 2025-12-04T09:41:42.9348436Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9348611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9348712Z Autotune Choices Stats: 2025-12-04T09:41:42.9349547Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9349641Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9349737Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9349842Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9350322Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9350797Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9351277Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9351760Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9352237Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9352726Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9353194Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9353747Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9354219Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9354684Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9355020Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.9355112Z Autotune Choices Stats: 2025-12-04T09:41:42.9355967Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9356061Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9356147Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9356260Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9356752Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9357249Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9357722Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9358269Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9358735Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9359197Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9359773Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9360243Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9360728Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9361197Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9361529Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.9361633Z Autotune Choices Stats: 2025-12-04T09:41:42.9362468Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.9362577Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9362663Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9362774Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9363339Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9363802Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9364272Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9364739Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9365218Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9365698Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9366165Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9366643Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9367109Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9367657Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9368035Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.9368208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9368310Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9368441Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9368697Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9369637Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9369726Z graph_break [] 2025-12-04T09:41:42.9369837Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9370016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9370116Z Autotune Choices Stats: 2025-12-04T09:41:42.9370950Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.9371044Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9371136Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9371243Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9371718Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9372195Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9372736Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9373207Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9373680Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9374156Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9374632Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9375104Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9375578Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9376041Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9376381Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.9376579Z Autotune Choices Stats: 2025-12-04T09:41:42.9377440Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.9377543Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9377648Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9377758Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9378222Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9378692Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9379168Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9379633Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9380101Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9380564Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9381033Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9381510Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9382063Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9382543Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9382871Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.9382972Z Autotune Choices Stats: 2025-12-04T09:41:42.9383808Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9383914Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9384007Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9384120Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9384602Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9385075Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9385560Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9386097Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9386565Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9387031Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9387495Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9387963Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9388427Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9388905Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9389232Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.9389406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9389505Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9389637Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9389883Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9391407Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9391499Z graph_break [] 2025-12-04T09:41:42.9391611Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9391787Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9391879Z Autotune Choices Stats: 2025-12-04T09:41:42.9392724Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9392822Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9392919Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9393023Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9393503Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9393982Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9394448Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9394917Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9395458Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9395937Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9396407Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9396876Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9397352Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9397854Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9398209Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.9398303Z Autotune Choices Stats: 2025-12-04T09:41:42.9399143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9399237Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9399329Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9399438Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9399987Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9400930Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9401417Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9401880Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9402348Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9402812Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9403297Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9403770Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9404244Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9404723Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9405175Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.9405358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9405452Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9405590Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9405845Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9406786Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9406874Z graph_break [] 2025-12-04T09:41:42.9406978Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9407151Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9407254Z Autotune Choices Stats: 2025-12-04T09:41:42.9408143Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.9408249Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9408335Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9408440Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9408927Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9409402Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9409883Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9410459Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9410932Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9411408Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9411875Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9412364Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9412841Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9413330Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9413664Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.9413758Z Autotune Choices Stats: 2025-12-04T09:41:42.9414594Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9414768Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9414865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9414973Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9415452Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9415933Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9416405Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9416885Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9417361Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9417834Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9418300Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9418773Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9419258Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9419810Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9420153Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.9420250Z Autotune Choices Stats: 2025-12-04T09:41:42.9421109Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9421215Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9421304Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9421429Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9421916Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9422397Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9422870Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9423339Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9423819Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9424369Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9424846Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9425320Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9425794Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9426271Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9426608Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.9426794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9426890Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9427024Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9427284Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9428274Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9428367Z graph_break [] 2025-12-04T09:41:42.9428480Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9428663Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9428763Z Autotune Choices Stats: 2025-12-04T09:41:42.9429667Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9429770Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9429859Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9429966Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9430455Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9430931Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9431432Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9431913Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9432395Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9432885Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9433477Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9433971Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9434442Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9434921Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9435254Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.9435350Z Autotune Choices Stats: 2025-12-04T09:41:42.9436200Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.9436296Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9436395Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9436502Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9436983Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9437515Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9437989Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9438553Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9439028Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9439560Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9440041Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9440519Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9441013Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9441487Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9441827Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.9441923Z Autotune Choices Stats: 2025-12-04T09:41:42.9442759Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9442942Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9443031Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9443152Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9443635Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9444114Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9444597Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9445079Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9445575Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9446060Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9446547Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9447076Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9447552Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9448136Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9448470Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.9448654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9448751Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9448887Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9449143Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9450520Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9450621Z graph_break [] 2025-12-04T09:41:42.9450728Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9450905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9451009Z Autotune Choices Stats: 2025-12-04T09:41:42.9451840Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9451943Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9452037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9452224Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9452707Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9453191Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9453673Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9454154Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9454634Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9455135Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9455617Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9456109Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9456589Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9457075Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9457441Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.9457635Z Autotune Choices Stats: 2025-12-04T09:41:42.9458489Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9458585Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9458680Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9458789Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9459274Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9459765Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9460240Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9460722Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9461191Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9461663Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9462216Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9462697Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9463185Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9463661Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9464005Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.9464185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9464282Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9464422Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9464674Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9466057Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9466142Z graph_break [] 2025-12-04T09:41:42.9466251Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9466438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9466532Z Autotune Choices Stats: 2025-12-04T09:41:42.9467510Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9467610Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9467698Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9467810Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9468297Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9468780Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9469261Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9469734Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9470210Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9470678Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9471164Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9471713Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9472186Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9472666Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9473004Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:42.9473104Z Autotune Choices Stats: 2025-12-04T09:41:42.9473948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9474056Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9474148Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9474253Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9474744Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9475219Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9475696Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9476176Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9476723Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9477200Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9477670Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9478155Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9478635Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9479128Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9487602Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:42.9487781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9487884Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9488016Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9488262Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9489662Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9489862Z graph_break [] 2025-12-04T09:41:42.9489976Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9490153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9490245Z Autotune Choices Stats: 2025-12-04T09:41:42.9491093Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:42.9491196Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9491289Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9491395Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9491884Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9492368Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9492845Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9493326Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9493807Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9494374Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9494856Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9495328Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9495805Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9496278Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9496621Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:42.9496714Z Autotune Choices Stats: 2025-12-04T09:41:42.9497569Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9497671Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9497757Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9497868Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9498350Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9498902Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9499384Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9499855Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9501012Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9501489Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9501968Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9502443Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9502915Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9503404Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9503741Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:42.9503921Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9504190Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9504324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9504579Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9505959Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9506051Z graph_break [] 2025-12-04T09:41:42.9506156Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9506329Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9506429Z Autotune Choices Stats: 2025-12-04T09:41:42.9507267Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9507364Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9507450Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9507555Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9508046Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9508637Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9509127Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9509608Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9510095Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9510584Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9511076Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9511563Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9512037Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9512512Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9512840Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:42.9512937Z Autotune Choices Stats: 2025-12-04T09:41:42.9513874Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:42.9513971Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9514067Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9514171Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9514654Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9515126Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9515593Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9516082Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9516554Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9517036Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9517554Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9518116Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9518608Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9519081Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9519416Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:42.9519654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9519752Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9519896Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9520146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9521527Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9521612Z graph_break [] 2025-12-04T09:41:42.9521719Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9521899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9521990Z Autotune Choices Stats: 2025-12-04T09:41:42.9522845Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9522944Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9523035Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9523230Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9523717Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9524197Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9524665Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9525140Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9525638Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9526106Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9526590Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9527062Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9527621Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9528103Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9528436Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:42.9528542Z Autotune Choices Stats: 2025-12-04T09:41:42.9529389Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9529499Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9529595Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9529701Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9530199Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9530675Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9531156Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9531624Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9532106Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9532665Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9533138Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9533617Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9534099Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9534581Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9534912Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:42.9535140Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.9535251Z Traceback (most recent call last): 2025-12-04T09:41:42.9535665Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9535854Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.9536206Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.9536387Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.9536564Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.9536727Z Searched string: 2025-12-04T09:41:42.9536862Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.9536877Z 2025-12-04T09:41:42.9536998Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.9537003Z 2025-12-04T09:41:42.9537139Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9537278Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9537283Z 2025-12-04T09:41:42.9537380Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.9537470Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.9537569Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9537660Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.9537664Z 2025-12-04T09:41:42.9537757Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.9537846Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.9537936Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9538029Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.9538038Z 2025-12-04T09:41:42.9538042Z 2025-12-04T09:41:42.9538206Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.9538210Z 2025-12-04T09:41:42.9538214Z 2025-12-04T09:41:42.9538337Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.9538466Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.9538583Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.9538673Z idx_m = rm[:, None] 2025-12-04T09:41:42.9538757Z idx_n = rn[None, :] 2025-12-04T09:41:42.9538854Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.9538859Z 2025-12-04T09:41:42.9538960Z # inductor generates a suffix 2025-12-04T09:41:42.9539053Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9539265Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.9539357Z ''', device_str='cuda') 2025-12-04T09:41:42.9539362Z 2025-12-04T09:41:42.9539365Z 2025-12-04T09:41:42.9539462Z async_compile.wait(globals()) 2025-12-04T09:41:42.9539554Z del async_compile 2025-12-04T09:41:42.9539558Z 2025-12-04T09:41:42.9539637Z class Runner: 2025-12-04T09:41:42.9539737Z def __init__(self, partitions): 2025-12-04T09:41:42.9539852Z self.partitions = partitions 2025-12-04T09:41:42.9539936Z 2025-12-04T09:41:42.9540054Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.9540143Z new_callables = [] 2025-12-04T09:41:42.9540264Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.9540368Z new_callables.append(fn(c)) 2025-12-04T09:41:42.9540475Z self.partitions = new_callables 2025-12-04T09:41:42.9540479Z 2025-12-04T09:41:42.9540567Z def call(self, args): 2025-12-04T09:41:42.9540655Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.9540748Z args.clear() 2025-12-04T09:41:42.9540877Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9541000Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9541122Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.9541218Z torch.cuda.set_device(0) 2025-12-04T09:41:42.9541388Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9541616Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.9541718Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9541916Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.9541997Z del arg0_1 2025-12-04T09:41:42.9542162Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9542430Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.9542530Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9542752Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:42.9542953Z del arg1_1 2025-12-04T09:41:42.9543031Z del buf0 2025-12-04T09:41:42.9543119Z return (buf1, ) 2025-12-04T09:41:42.9543123Z 2025-12-04T09:41:42.9543225Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.9543310Z call = runner.call 2025-12-04T09:41:42.9543486Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.9543490Z 2025-12-04T09:41:42.9543494Z 2025-12-04T09:41:42.9543632Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.9543769Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.9543925Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.9544124Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.9544332Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.9544430Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.9544592Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.9544601Z 2025-12-04T09:41:42.9544605Z 2025-12-04T09:41:42.9544702Z if __name__ == "__main__": 2025-12-04T09:41:42.9544903Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.9545076Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.9545162Z From CHECK: .to( 2025-12-04T09:41:42.9545166Z 2025-12-04T09:41:42.9545170Z 2025-12-04T09:41:42.9545346Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.9545905Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9545910Z 2025-12-04T09:41:42.9546128Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.9546313Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9546406Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9546539Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9546795Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9548308Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9548398Z graph_break [] 2025-12-04T09:41:42.9548507Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9548684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9548779Z Autotune Choices Stats: 2025-12-04T09:41:42.9549621Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9549732Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9549822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9549926Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9550430Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9550898Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9551378Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9551926Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9552390Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9552871Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9553327Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9553791Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9554266Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9554740Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9555080Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.9555175Z Autotune Choices Stats: 2025-12-04T09:41:42.9556012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9556109Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9556196Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9556308Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9556862Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9557340Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9557796Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9558264Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9558732Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9559203Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9559717Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9560190Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9560659Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9561203Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9561541Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.9561715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9561813Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9561948Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9562196Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9563136Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9563230Z graph_break [] 2025-12-04T09:41:42.9563333Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9563511Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9563605Z Autotune Choices Stats: 2025-12-04T09:41:42.9564432Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9564532Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9564618Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9564727Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9565202Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9565677Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9566232Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9566708Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9567187Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9567674Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9568149Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9568631Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9569100Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9569570Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9569900Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.9570073Z Autotune Choices Stats: 2025-12-04T09:41:42.9570904Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9570996Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9571082Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9571187Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9571658Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9572121Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9572600Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9573079Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9573541Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9574009Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9574470Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9574944Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9575489Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9575967Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9576305Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.9576396Z Autotune Choices Stats: 2025-12-04T09:41:42.9577233Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.9577331Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9577418Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9577535Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9578006Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9578471Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9578938Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9579409Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9579990Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9580458Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9580928Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9581398Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9581865Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9582345Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9582669Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.9582845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9582938Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9583069Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9583312Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9584252Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9584339Z graph_break [] 2025-12-04T09:41:42.9584520Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9584694Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9584787Z Autotune Choices Stats: 2025-12-04T09:41:42.9585618Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.9585713Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9585797Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9585900Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9586374Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9586849Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9587312Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9587772Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9588245Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9588791Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9589260Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9589732Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9590193Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9590667Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9591003Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.9591094Z Autotune Choices Stats: 2025-12-04T09:41:42.9591926Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.9592018Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9592105Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9592208Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9592678Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9593147Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9593694Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9594158Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9594618Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9595086Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9595553Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9596028Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9596498Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9596966Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9597298Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.9597388Z Autotune Choices Stats: 2025-12-04T09:41:42.9598224Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9598397Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9598482Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9598597Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9599071Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9599654Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9600137Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9600799Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9601268Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9601728Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9602191Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9602649Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9603235Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9603715Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9604044Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.9604220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9604313Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9604444Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9604689Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9606076Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9606162Z graph_break [] 2025-12-04T09:41:42.9606263Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9606434Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9606527Z Autotune Choices Stats: 2025-12-04T09:41:42.9607365Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9607563Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9607648Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9607756Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9608247Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9608709Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9609170Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9609628Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9610095Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9610565Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9611031Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9611504Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9611971Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9612596Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9612927Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.9613017Z Autotune Choices Stats: 2025-12-04T09:41:42.9613850Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9613940Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9614027Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9614129Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9614605Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9615086Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9615564Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9616036Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9616504Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9617055Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9617528Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9618004Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9618483Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9618958Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9619302Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.9619482Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9619578Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9619717Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9619970Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9620916Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9621000Z graph_break [] 2025-12-04T09:41:42.9621104Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9621288Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9621381Z Autotune Choices Stats: 2025-12-04T09:41:42.9622311Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.9622408Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9622497Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9622612Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9623095Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9623578Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9624057Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9624537Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9625013Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9625483Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9625955Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9626519Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9627013Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9627485Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9627816Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.9627915Z Autotune Choices Stats: 2025-12-04T09:41:42.9628739Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9628844Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9628936Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9629043Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9629524Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9629997Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9630481Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9630953Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9631497Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9631974Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9632442Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9632922Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9633400Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9633885Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9634213Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.9634308Z Autotune Choices Stats: 2025-12-04T09:41:42.9635347Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9635541Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9635634Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9635745Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9636233Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9636718Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9637184Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9637660Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9638143Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9638614Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9639095Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9639626Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9640108Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9640590Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9641011Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.9641187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9641284Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9641422Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9641668Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9642619Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9642712Z graph_break [] 2025-12-04T09:41:42.9642816Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9642999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9643101Z Autotune Choices Stats: 2025-12-04T09:41:42.9643939Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9644046Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9644133Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9644254Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9644742Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9645296Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9645786Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9646269Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9646756Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9647235Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9647719Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9648206Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9648677Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9649166Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9649499Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.9649606Z Autotune Choices Stats: 2025-12-04T09:41:42.9650543Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.9650653Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9650744Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9650852Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9651333Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9651812Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9652297Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9652781Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9653253Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9653728Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9654192Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9654753Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9655236Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9655729Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9656060Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.9656159Z Autotune Choices Stats: 2025-12-04T09:41:42.9656997Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9657096Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9657201Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9657320Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9657797Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9658274Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9658751Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9659236Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9659800Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9660281Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9660776Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9661244Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9661731Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9662202Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9662538Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.9662713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9662810Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9662953Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9663199Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9664578Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9664744Z graph_break [] 2025-12-04T09:41:42.9664851Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9665030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9665123Z Autotune Choices Stats: 2025-12-04T09:41:42.9665957Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9666060Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9666149Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9666262Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9666746Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9667218Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9667699Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9668177Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9668662Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9669220Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9669707Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9670191Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9670675Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9671161Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9671496Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.9671599Z Autotune Choices Stats: 2025-12-04T09:41:42.9672437Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9672538Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9672625Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9672729Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9673213Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9673780Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9674255Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9674722Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9675189Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9675664Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9676140Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9676623Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9677149Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9677631Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9677962Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.9678135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9678236Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9678446Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9678694Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9680145Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9680229Z graph_break [] 2025-12-04T09:41:42.9680350Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9680526Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9680618Z Autotune Choices Stats: 2025-12-04T09:41:42.9681476Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9681571Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9681667Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9681774Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9682262Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9682743Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9683324Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9683801Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9684269Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9684743Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9685217Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9685697Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9686176Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9686650Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9686991Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:42.9687085Z Autotune Choices Stats: 2025-12-04T09:41:42.9687997Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9688101Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9688193Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9688305Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9688789Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9689265Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9689745Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9690226Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9690700Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9691168Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9691643Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9692121Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9692679Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9693160Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9693493Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:42.9693671Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9693766Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9693900Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9694153Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9695545Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9695639Z graph_break [] 2025-12-04T09:41:42.9695744Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9695919Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9696017Z Autotune Choices Stats: 2025-12-04T09:41:42.9696856Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:42.9696961Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9697048Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9697153Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9698051Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9698536Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9699019Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9699494Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9699986Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9700695Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9701176Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9701648Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9702118Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9702726Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9703063Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:42.9703157Z Autotune Choices Stats: 2025-12-04T09:41:42.9704003Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9704097Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9704194Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9704306Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9704781Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9705269Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9705744Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9706231Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9706703Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9707190Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9707806Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9708286Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9708769Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9709247Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9709589Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:42.9709768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9709865Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9710004Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9710253Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9711660Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9711821Z graph_break [] 2025-12-04T09:41:42.9711926Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9712116Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9712210Z Autotune Choices Stats: 2025-12-04T09:41:42.9713051Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9713150Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9713240Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9713354Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9713830Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9714323Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9714804Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9715286Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9715777Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9716260Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9716840Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9717322Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9717856Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9718327Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9718661Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:42.9718770Z Autotune Choices Stats: 2025-12-04T09:41:42.9719656Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:42.9719761Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9719851Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9719955Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9720436Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9720904Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9721485Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9721967Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9722444Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9722914Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9723388Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9723875Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9724355Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9724837Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9725170Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:42.9725346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9725453Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9725591Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9725850Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9727361Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9727460Z graph_break [] 2025-12-04T09:41:42.9727566Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9727741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9727843Z Autotune Choices Stats: 2025-12-04T09:41:42.9728695Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9728801Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9728898Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9729005Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9729503Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9729977Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9730450Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9731002Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9731478Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9731957Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9732432Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9732922Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9733405Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9733886Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9734236Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:42.9734335Z Autotune Choices Stats: 2025-12-04T09:41:42.9735190Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9735297Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9735386Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9735496Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9736057Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9736539Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9737017Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9737484Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9737962Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9738432Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9738911Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9739387Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9739871Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9740424Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9740762Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:42.9740944Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9741040Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9741179Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9741429Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9742375Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9742471Z graph_break [] 2025-12-04T09:41:42.9742574Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9742754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9742850Z Autotune Choices Stats: 2025-12-04T09:41:42.9743700Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.9743803Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9743890Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9743997Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9744480Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9744956Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9745512Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9745985Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9746468Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9746943Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9747422Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9747915Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9748393Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9748878Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9753136Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:42.9753355Z Autotune Choices Stats: 2025-12-04T09:41:42.9754362Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9754460Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9754554Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9754666Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9755233Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9755795Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9756352Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9756907Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9757457Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9758008Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9758556Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9759116Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9759894Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9760390Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9760721Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:42.9760813Z Autotune Choices Stats: 2025-12-04T09:41:42.9761660Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9761764Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9761852Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9761970Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9762453Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9762935Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9763406Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9763968Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9764458Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9764940Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9765419Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9765895Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9766388Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9766895Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9767251Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:42.9767433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9767529Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9767664Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9767912Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9768850Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9768940Z graph_break [] 2025-12-04T09:41:42.9769172Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9769350Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9769449Z Autotune Choices Stats: 2025-12-04T09:41:42.9770296Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9770391Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9770477Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9770582Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9771067Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9771554Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9772033Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9772512Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9772989Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9773535Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9774008Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9774477Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9774944Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9775415Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9775755Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:42.9775848Z Autotune Choices Stats: 2025-12-04T09:41:42.9776683Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9776777Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9776868Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9776973Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9777442Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9777922Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9778468Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9778940Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9779408Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9779885Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9780355Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9780832Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9781306Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9781777Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9782110Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:42.9782283Z Autotune Choices Stats: 2025-12-04T09:41:42.9783127Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9783225Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9783311Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9783425Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9783900Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9784372Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9784845Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9785333Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9785813Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9786291Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9786779Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9787258Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9787849Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9788321Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9788648Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:42.9788826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9788922Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9789053Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9789306Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9790241Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9790328Z graph_break [] 2025-12-04T09:41:42.9790432Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9790606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9790702Z Autotune Choices Stats: 2025-12-04T09:41:42.9791544Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9791724Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9791808Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9791914Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9792399Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9792876Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9793352Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9793830Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9794320Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9794802Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9795270Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9795744Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9796209Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9796687Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9797168Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:42.9797263Z Autotune Choices Stats: 2025-12-04T09:41:42.9798108Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:42.9798201Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9798289Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9798394Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9798865Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9799346Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9799908Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9800609Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9801077Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9801673Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9802150Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9802632Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9803108Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9803579Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9803919Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:42.9804013Z Autotune Choices Stats: 2025-12-04T09:41:42.9804853Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9804951Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9805036Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9805151Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9805626Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9806096Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9806682Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9807150Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9807620Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9808086Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9808564Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9809044Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9809517Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9809999Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9810328Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:42.9810507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9810682Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9810815Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9811068Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9812447Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9812534Z graph_break [] 2025-12-04T09:41:42.9812639Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9812812Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9812907Z Autotune Choices Stats: 2025-12-04T09:41:42.9813744Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9813840Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9813925Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9814031Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9814508Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9814980Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9815462Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9816015Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9816486Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9816954Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9817427Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9817902Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9818377Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9818852Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9819182Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:42.9819274Z Autotune Choices Stats: 2025-12-04T09:41:42.9820112Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9820280Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9820368Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9820472Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9820955Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9821428Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9821892Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9822360Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9822835Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9823300Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9823767Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9824240Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9824716Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9825266Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9825599Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:42.9825821Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:42.9825923Z Traceback (most recent call last): 2025-12-04T09:41:42.9826340Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9826521Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:42.9826871Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:42.9827057Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:42.9827219Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:42.9827307Z Searched string: 2025-12-04T09:41:42.9827443Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:42.9827448Z 2025-12-04T09:41:42.9827564Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:42.9827569Z 2025-12-04T09:41:42.9827698Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9827823Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:42.9827827Z 2025-12-04T09:41:42.9827922Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:42.9828011Z idx_n = a_k_idx_vals 2025-12-04T09:41:42.9828102Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9828196Z a = tl.load(A + (xindex)) 2025-12-04T09:41:42.9828200Z 2025-12-04T09:41:42.9828290Z idx_m = b_k_idx_vals 2025-12-04T09:41:42.9828488Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:42.9828580Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9828669Z b = tl.load(B + (xindex)) 2025-12-04T09:41:42.9828674Z 2025-12-04T09:41:42.9828678Z 2025-12-04T09:41:42.9828845Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:42.9828850Z 2025-12-04T09:41:42.9828853Z 2025-12-04T09:41:42.9828974Z # rematerialize rm and rn to save registers 2025-12-04T09:41:42.9829089Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:42.9829203Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:42.9829289Z idx_m = rm[:, None] 2025-12-04T09:41:42.9829375Z idx_n = rn[None, :] 2025-12-04T09:41:42.9829468Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:42.9829473Z 2025-12-04T09:41:42.9829570Z # inductor generates a suffix 2025-12-04T09:41:42.9829662Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:42.9829874Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:42.9829965Z ''', device_str='cuda') 2025-12-04T09:41:42.9829969Z 2025-12-04T09:41:42.9829973Z 2025-12-04T09:41:42.9830077Z async_compile.wait(globals()) 2025-12-04T09:41:42.9830161Z del async_compile 2025-12-04T09:41:42.9830165Z 2025-12-04T09:41:42.9830252Z class Runner: 2025-12-04T09:41:42.9830357Z def __init__(self, partitions): 2025-12-04T09:41:42.9830458Z self.partitions = partitions 2025-12-04T09:41:42.9830462Z 2025-12-04T09:41:42.9830572Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:42.9830661Z new_callables = [] 2025-12-04T09:41:42.9830776Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:42.9830883Z new_callables.append(fn(c)) 2025-12-04T09:41:42.9830985Z self.partitions = new_callables 2025-12-04T09:41:42.9830990Z 2025-12-04T09:41:42.9831082Z def call(self, args): 2025-12-04T09:41:42.9831170Z arg0_1, arg1_1 = args 2025-12-04T09:41:42.9831252Z args.clear() 2025-12-04T09:41:42.9831389Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9831512Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:42.9831615Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:42.9831796Z torch.cuda.set_device(0) 2025-12-04T09:41:42.9831963Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9832182Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:42.9832283Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9832472Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:42.9832555Z del arg0_1 2025-12-04T09:41:42.9832716Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:42.9832970Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:42.9833076Z stream0 = get_raw_stream(0) 2025-12-04T09:41:42.9833293Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:42.9833373Z del arg1_1 2025-12-04T09:41:42.9833453Z del buf0 2025-12-04T09:41:42.9833542Z return (buf1, ) 2025-12-04T09:41:42.9833546Z 2025-12-04T09:41:42.9833649Z runner = Runner(partitions=[]) 2025-12-04T09:41:42.9833735Z call = runner.call 2025-12-04T09:41:42.9833891Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:42.9833896Z 2025-12-04T09:41:42.9833899Z 2025-12-04T09:41:42.9834039Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:42.9834168Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:42.9834315Z from torch._inductor.utils import print_performance 2025-12-04T09:41:42.9834517Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:42.9834716Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:42.9834894Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:42.9835054Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:42.9835059Z 2025-12-04T09:41:42.9835063Z 2025-12-04T09:41:42.9835155Z if __name__ == "__main__": 2025-12-04T09:41:42.9835356Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:42.9835514Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:42.9835599Z From CHECK: .to( 2025-12-04T09:41:42.9835603Z 2025-12-04T09:41:42.9835609Z 2025-12-04T09:41:42.9835784Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:42.9836338Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:42.9836343Z 2025-12-04T09:41:42.9836562Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:42.9836742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9836840Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9836988Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9837276Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9838658Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9838744Z graph_break [] 2025-12-04T09:41:42.9838849Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9839022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9839117Z Autotune Choices Stats: 2025-12-04T09:41:42.9840093Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9840191Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9840277Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9840386Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9840868Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9841330Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9841799Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9842260Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9842718Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9843191Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9843651Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9844181Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9844643Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9845107Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9845441Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:42.9845534Z Autotune Choices Stats: 2025-12-04T09:41:42.9846361Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9846462Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9846551Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9846653Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9847132Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9847589Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9848045Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9848504Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9849036Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9849492Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9849948Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9850414Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9850882Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9851352Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9851684Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:42.9851857Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9851953Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9852085Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9852332Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9853272Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9853433Z graph_break [] 2025-12-04T09:41:42.9853543Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9853717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9853813Z Autotune Choices Stats: 2025-12-04T09:41:42.9854636Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9854730Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9854818Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9854923Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9855400Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9855875Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9856347Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9856926Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9857517Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9858127Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9858799Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9859328Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9859799Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9860260Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9860599Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:42.9860690Z Autotune Choices Stats: 2025-12-04T09:41:42.9861523Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9861616Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9861702Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9861809Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9862276Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9862740Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9863316Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9863779Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9864242Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9864706Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9865175Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9865651Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9866119Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9866670Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9867084Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:42.9867203Z Autotune Choices Stats: 2025-12-04T09:41:42.9868342Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:42.9868462Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9868560Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9868669Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9869146Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9869607Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9870069Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9870542Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9871014Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9871487Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9871952Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9872420Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9872966Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9873428Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9873758Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:42.9873932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9874028Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9874158Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9874402Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9875350Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9875433Z graph_break [] 2025-12-04T09:41:42.9875551Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9875729Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9875823Z Autotune Choices Stats: 2025-12-04T09:41:42.9876663Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:42.9876763Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9876855Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9876965Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9877523Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9877999Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9878459Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9878929Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9879405Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9880022Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9880491Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9880966Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9881434Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9881976Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9882321Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:42.9882416Z Autotune Choices Stats: 2025-12-04T09:41:42.9883246Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:42.9883348Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9883437Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9883549Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9884016Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9884492Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9884962Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9885425Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9885894Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9886356Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9886909Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9887379Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9887848Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9888327Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9888660Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:42.9888765Z Autotune Choices Stats: 2025-12-04T09:41:42.9889599Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9889698Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9889786Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9889897Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9890381Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9890854Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9891430Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9891905Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9892367Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9892835Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9893295Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9893768Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9894245Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9894715Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9895050Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:42.9895226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9895328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9895467Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9895713Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9897259Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9897344Z graph_break [] 2025-12-04T09:41:42.9897454Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9897628Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9897723Z Autotune Choices Stats: 2025-12-04T09:41:42.9898566Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9898672Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9898763Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9898871Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9899352Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9899820Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9900692Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9901285Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9901755Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9902233Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9902705Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9903177Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9903654Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9904120Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9904459Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:42.9904551Z Autotune Choices Stats: 2025-12-04T09:41:42.9905381Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9905492Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9905581Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9905689Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9906275Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9906748Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9907232Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9907697Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9908170Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9908636Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9909108Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9909584Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9910061Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9910663Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9911002Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:42.9911189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9911291Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9911426Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9911685Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9912629Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9912721Z graph_break [] 2025-12-04T09:41:42.9912828Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9913004Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9913107Z Autotune Choices Stats: 2025-12-04T09:41:42.9913946Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:42.9914047Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9914136Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9914243Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9914731Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9915210Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9915772Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9916242Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9916712Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9917183Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9917706Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9918191Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9918666Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9919147Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9919535Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:42.9919707Z Autotune Choices Stats: 2025-12-04T09:41:42.9920569Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9920666Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9920757Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9920866Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9921347Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9921833Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9922312Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9922793Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9923260Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9923726Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9924201Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9924679Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9925237Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9925714Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9926046Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:42.9926141Z Autotune Choices Stats: 2025-12-04T09:41:42.9926995Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9927100Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9927188Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9927308Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9927838Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9928313Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9928788Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9929343Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9929827Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9930295Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9930765Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9931245Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9931725Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9932211Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9932538Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:42.9932721Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9932821Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9932956Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9933205Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9934148Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9934242Z graph_break [] 2025-12-04T09:41:42.9934455Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9934634Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9934738Z Autotune Choices Stats: 2025-12-04T09:41:42.9935576Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9935671Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9935768Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9935871Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9936358Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9936864Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9937368Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9937851Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9938329Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9938892Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9939375Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9939861Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9940329Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9940804Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9941144Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:42.9941238Z Autotune Choices Stats: 2025-12-04T09:41:42.9942073Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:42.9942167Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9942254Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9942369Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9942845Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9943328Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9943879Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9944359Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9944835Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9945309Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9945791Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9946270Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9946774Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9947271Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9947611Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:42.9947782Z Autotune Choices Stats: 2025-12-04T09:41:42.9948636Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9948736Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9948823Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9948936Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:42.9949420Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9949891Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9950370Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9950861Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9951347Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9951828Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9952314Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9952792Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9953348Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9953823Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9954159Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:42.9954347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9954445Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9954578Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9954841Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9956224Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9956313Z graph_break [] 2025-12-04T09:41:42.9956420Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9956597Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9956700Z Autotune Choices Stats: 2025-12-04T09:41:42.9957551Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9957750Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9957848Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9957959Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9958451Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9958927Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9959398Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9959940Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9960430Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9960917Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9961400Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9961891Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:42.9962378Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9962929Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9963268Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:42.9963361Z Autotune Choices Stats: 2025-12-04T09:41:42.9964202Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9964299Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9964392Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9964504Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9964989Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9965475Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9965947Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9966429Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9966896Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9967472Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9967945Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9968418Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9968898Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9969371Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9969717Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:42.9969891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9969986Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9970125Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9970373Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9971747Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9971841Z graph_break [] 2025-12-04T09:41:42.9971945Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9972205Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9972300Z Autotune Choices Stats: 2025-12-04T09:41:42.9973159Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:42.9973257Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9973345Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9973456Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9973937Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9974419Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9974894Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9975361Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9975832Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9976299Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9976865Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9977365Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9977858Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9978333Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9978676Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:42.9978777Z Autotune Choices Stats: 2025-12-04T09:41:42.9979616Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9979719Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9979813Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9979919Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9980400Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9980878Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9981436Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9981913Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9982381Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9982849Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9983314Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9983799Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9984273Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9984743Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9985079Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:42.9985256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:42.9985443Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:42.9985575Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:42.9985822Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:42.9987215Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:42.9987297Z graph_break [] 2025-12-04T09:41:42.9987407Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:42.9987584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:42.9987700Z Autotune Choices Stats: 2025-12-04T09:41:42.9988578Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:42.9988677Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9988766Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9988873Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:42.9989347Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9989828Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9990301Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9990860Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9991338Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9991822Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9992295Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:42.9992763Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9993242Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9993710Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9994049Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:42.9994144Z Autotune Choices Stats: 2025-12-04T09:41:42.9994981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:42.9995157Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:42.9995244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:42.9995354Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:42.9995841Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9996312Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9996787Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:42.9997259Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:42.9997783Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9998251Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:42.9998721Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:42.9999194Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:42.9999725Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0000584Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0000927Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.0001107Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0001205Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0001336Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0001588Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0002967Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0003065Z graph_break [] 2025-12-04T09:41:43.0003169Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0003343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0003440Z Autotune Choices Stats: 2025-12-04T09:41:43.0004270Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0004367Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0004454Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0004696Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0005177Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0005655Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0006138Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0006619Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0007149Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0007635Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0008127Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0008605Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0009080Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0009550Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0009885Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.0010070Z Autotune Choices Stats: 2025-12-04T09:41:43.0010906Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.0011001Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0011090Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0011197Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0011672Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0012152Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0012623Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0013097Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0013564Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0014036Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0014583Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0015059Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0015535Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0016010Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0019824Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.0020029Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0020126Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0020263Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0020515Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0021889Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0021975Z graph_break [] 2025-12-04T09:41:43.0022081Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0022258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0022356Z Autotune Choices Stats: 2025-12-04T09:41:43.0023311Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0023411Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0023498Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0023613Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0024103Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0024584Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0025057Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0025530Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0026000Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0026465Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0026940Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0027539Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0028029Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0028504Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0028840Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.0028936Z Autotune Choices Stats: 2025-12-04T09:41:43.0029774Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0029879Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0029970Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0030074Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0030563Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0031038Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0031515Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0031987Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0032530Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0033005Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0033475Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0033956Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0034433Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0034919Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0035249Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.0035423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0035521Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0035653Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0035904Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0036866Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0037051Z graph_break [] 2025-12-04T09:41:43.0037163Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0037337Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0037434Z Autotune Choices Stats: 2025-12-04T09:41:43.0038268Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0038362Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0038450Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0038554Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0039036Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0039590Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0040059Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0040532Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0041007Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0041489Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0042071Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0042556Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0043038Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0043511Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0043853Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.0043948Z Autotune Choices Stats: 2025-12-04T09:41:43.0044794Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0044889Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0044976Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0045084Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0045557Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0046113Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0046588Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0047106Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0047578Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0048045Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0048518Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0048994Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0049471Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0049946Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0050275Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.0050375Z Autotune Choices Stats: 2025-12-04T09:41:43.0051293Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0051395Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0051482Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0051592Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0052077Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0052552Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0053032Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0053515Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0053995Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0054473Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0054952Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0055517Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0056006Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0056487Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0056813Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.0056987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0057088Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0057223Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0057476Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0058424Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0058507Z graph_break [] 2025-12-04T09:41:43.0058612Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0058787Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0058878Z Autotune Choices Stats: 2025-12-04T09:41:43.0059712Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0059809Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0059899Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0060004Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0060559Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0061043Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0061523Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0062007Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0062480Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0062959Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0063427Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0063897Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0064368Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0064958Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0065297Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.0065390Z Autotune Choices Stats: 2025-12-04T09:41:43.0066220Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0066317Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0066402Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0066509Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0066980Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0067464Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0067983Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0068453Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0068923Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0069394Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0069941Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0070415Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0070888Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0071365Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0071701Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.0071798Z Autotune Choices Stats: 2025-12-04T09:41:43.0072636Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0072732Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0072818Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0072928Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0073407Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0073878Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0074439Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0074922Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0075398Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0075879Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0076362Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0076850Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0077337Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0077837Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0078169Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.0078343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0078447Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0078579Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0078928Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0079934Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0080019Z graph_break [] 2025-12-04T09:41:43.0080124Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0080298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0080389Z Autotune Choices Stats: 2025-12-04T09:41:43.0081226Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0081327Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0081419Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0081525Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0082004Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0082490Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0082962Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0083529Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0084016Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0084490Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0084961Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0085432Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0085908Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0086379Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0086717Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.0086812Z Autotune Choices Stats: 2025-12-04T09:41:43.0087638Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.0087738Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0087825Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0087934Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0088487Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0088961Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0089436Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0089903Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0090378Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0090846Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0091317Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0091788Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0092261Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0092816Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0093150Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.0093246Z Autotune Choices Stats: 2025-12-04T09:41:43.0094080Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0094174Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0094263Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0094376Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0094859Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0095340Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0095807Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0096278Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0096744Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0097215Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0097818Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0098296Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0098768Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0099246Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0099579Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.0099752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0099852Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0099988Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0100409Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0101802Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0101887Z graph_break [] 2025-12-04T09:41:43.0102125Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0102299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0102392Z Autotune Choices Stats: 2025-12-04T09:41:43.0103231Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0103326Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0103415Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0103521Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0103999Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0104477Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0104969Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0105445Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0105912Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0106378Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0106854Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0107430Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0107906Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0108379Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0108718Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.0108812Z Autotune Choices Stats: 2025-12-04T09:41:43.0109662Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0109769Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0109855Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0109964Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0110443Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0110910Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0111385Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0111958Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0112430Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0112896Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0113366Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0113836Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0114317Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0114792Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0115123Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.0115299Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0115395Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0115525Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0115775Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0117294Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0117384Z graph_break [] 2025-12-04T09:41:43.0117489Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0117662Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0117755Z Autotune Choices Stats: 2025-12-04T09:41:43.0118595Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0118697Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0118783Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0118888Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0119377Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0119898Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0120370Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0120847Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0121399Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0121876Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0122348Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0122822Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0123291Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0123774Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0124106Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.0124199Z Autotune Choices Stats: 2025-12-04T09:41:43.0125031Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0125126Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0125213Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0125317Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0125793Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0126343Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0126811Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0127287Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0127755Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0128226Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0128697Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0129170Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0129645Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0130117Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0130525Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.0130752Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.0130856Z Traceback (most recent call last): 2025-12-04T09:41:43.0131275Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.0131457Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.0131807Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.0131988Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.0132150Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.0132237Z Searched string: 2025-12-04T09:41:43.0132376Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.0132382Z 2025-12-04T09:41:43.0132498Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.0132503Z 2025-12-04T09:41:43.0132636Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.0132761Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.0132766Z 2025-12-04T09:41:43.0132861Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.0132951Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.0133044Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.0133138Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.0133142Z 2025-12-04T09:41:43.0133229Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.0133318Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.0133413Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.0133502Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.0133506Z 2025-12-04T09:41:43.0133510Z 2025-12-04T09:41:43.0133675Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.0133680Z 2025-12-04T09:41:43.0133683Z 2025-12-04T09:41:43.0133803Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.0133917Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.0134112Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.0134201Z idx_m = rm[:, None] 2025-12-04T09:41:43.0134286Z idx_n = rn[None, :] 2025-12-04T09:41:43.0134383Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.0134387Z 2025-12-04T09:41:43.0134483Z # inductor generates a suffix 2025-12-04T09:41:43.0134577Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.0134787Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.0134874Z ''', device_str='cuda') 2025-12-04T09:41:43.0134879Z 2025-12-04T09:41:43.0134882Z 2025-12-04T09:41:43.0134982Z async_compile.wait(globals()) 2025-12-04T09:41:43.0135063Z del async_compile 2025-12-04T09:41:43.0135071Z 2025-12-04T09:41:43.0135150Z class Runner: 2025-12-04T09:41:43.0135251Z def __init__(self, partitions): 2025-12-04T09:41:43.0135354Z self.partitions = partitions 2025-12-04T09:41:43.0135358Z 2025-12-04T09:41:43.0135471Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.0135561Z new_callables = [] 2025-12-04T09:41:43.0135676Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.0135782Z new_callables.append(fn(c)) 2025-12-04T09:41:43.0135886Z self.partitions = new_callables 2025-12-04T09:41:43.0135891Z 2025-12-04T09:41:43.0135978Z def call(self, args): 2025-12-04T09:41:43.0136068Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.0136150Z args.clear() 2025-12-04T09:41:43.0136301Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.0136476Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.0136632Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.0136863Z torch.cuda.set_device(0) 2025-12-04T09:41:43.0137030Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.0137248Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.0137358Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.0137578Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.0137677Z del arg0_1 2025-12-04T09:41:43.0137839Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.0138091Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.0138191Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.0138406Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.0138486Z del arg1_1 2025-12-04T09:41:43.0138567Z del buf0 2025-12-04T09:41:43.0138656Z return (buf1, ) 2025-12-04T09:41:43.0138661Z 2025-12-04T09:41:43.0138759Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.0138843Z call = runner.call 2025-12-04T09:41:43.0139001Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.0139009Z 2025-12-04T09:41:43.0139013Z 2025-12-04T09:41:43.0139154Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.0139283Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.0139431Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.0139632Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.0139832Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.0139932Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.0140095Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.0140100Z 2025-12-04T09:41:43.0140108Z 2025-12-04T09:41:43.0140197Z if __name__ == "__main__": 2025-12-04T09:41:43.0140398Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.0140557Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.0140729Z From CHECK: .to( 2025-12-04T09:41:43.0140734Z 2025-12-04T09:41:43.0140741Z 2025-12-04T09:41:43.0140921Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.0141478Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.0141483Z 2025-12-04T09:41:43.0141704Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.0141880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0141973Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0142107Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0142362Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0143751Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0143839Z graph_break [] 2025-12-04T09:41:43.0143942Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0144120Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0144212Z Autotune Choices Stats: 2025-12-04T09:41:43.0145057Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0145256Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0145348Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0145458Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0145942Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0146405Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0146868Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0147330Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0147835Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0148310Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0148772Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0149229Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0149693Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0150233Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0150570Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.0150670Z Autotune Choices Stats: 2025-12-04T09:41:43.0151493Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0151590Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0151681Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0151786Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0152273Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0152734Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0153194Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0153650Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0154105Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0154646Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0155104Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0155572Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0156039Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0156507Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0156845Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.0157023Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0157121Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0157253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0157502Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0158439Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0158529Z graph_break [] 2025-12-04T09:41:43.0158636Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0158809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0158898Z Autotune Choices Stats: 2025-12-04T09:41:43.0159930Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0160027Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0160121Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0160227Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0160699Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0161174Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0161652Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0162130Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0162605Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0163089Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0163629Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0164101Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0164576Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0165038Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0165372Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.0165470Z Autotune Choices Stats: 2025-12-04T09:41:43.0166321Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0166417Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0166502Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0166607Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0167079Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0167591Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0168064Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0168612Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0169080Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0169541Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0170009Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0170477Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0170957Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0171427Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0171758Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.0171857Z Autotune Choices Stats: 2025-12-04T09:41:43.0172692Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.0172866Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0172953Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0173064Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0173543Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0174006Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0174468Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0174938Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0175415Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0175891Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.0176358Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0176831Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0177294Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0177916Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0178248Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.0178423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0178522Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0178657Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0178904Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0179849Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0179938Z graph_break [] 2025-12-04T09:41:43.0180045Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0180222Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0180316Z Autotune Choices Stats: 2025-12-04T09:41:43.0181148Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0181244Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0181337Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0181443Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0181917Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0182467Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0182929Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0183393Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0183861Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0184334Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0184809Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0185278Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0185742Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0186205Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0186539Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.0186637Z Autotune Choices Stats: 2025-12-04T09:41:43.0187540Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.0187638Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0187725Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0187836Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0188306Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0188773Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0189247Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0189717Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0190188Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0190650Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0191115Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0191662Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0192135Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0192608Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0192938Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.0193034Z Autotune Choices Stats: 2025-12-04T09:41:43.0193865Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0193963Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0194057Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0194170Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0194645Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0195119Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0195596Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0196069Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0196613Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0197127Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0197591Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0198053Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0198520Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0198991Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0199325Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.0199557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0199652Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0199785Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0200028Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0201576Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0201785Z graph_break [] 2025-12-04T09:41:43.0201889Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0202061Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0202153Z Autotune Choices Stats: 2025-12-04T09:41:43.0203002Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0203098Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0203184Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0203287Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0203768Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0204230Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0204696Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0205156Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0205619Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0206243Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0206720Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0207212Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0207705Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0208170Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0208508Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.0208598Z Autotune Choices Stats: 2025-12-04T09:41:43.0209427Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0209520Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0209604Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0209710Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0210183Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0210732Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0211207Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0211668Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0212131Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0212591Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0213067Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0213541Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0214013Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0214485Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0214821Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.0214995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0215088Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0215324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0215573Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0216512Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0216597Z graph_break [] 2025-12-04T09:41:43.0216699Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0216869Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0216966Z Autotune Choices Stats: 2025-12-04T09:41:43.0217806Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.0217901Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0217985Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0218087Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0218569Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0219043Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0219594Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0220067Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0220535Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0221002Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0221469Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0221946Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0222419Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0222894Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0223222Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.0223311Z Autotune Choices Stats: 2025-12-04T09:41:43.0224144Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0224239Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0224322Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0224500Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0224975Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0225455Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0225926Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0226398Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0226876Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0227366Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0227861Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0228332Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0228893Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0229369Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0229702Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.0229791Z Autotune Choices Stats: 2025-12-04T09:41:43.0230632Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0230727Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0230815Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0230924Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0231409Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0231885Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0232360Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0232825Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0233306Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0233853Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0234325Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0234799Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0235268Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0235746Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0236083Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.0236261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0236352Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0236482Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0236731Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0237703Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0237886Z graph_break [] 2025-12-04T09:41:43.0237993Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0238163Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0238255Z Autotune Choices Stats: 2025-12-04T09:41:43.0239086Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0239177Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0239264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0239367Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0239903Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0240377Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0240860Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0241342Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0241817Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0242298Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0242775Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0243335Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0243804Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0244275Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0244613Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.0244707Z Autotune Choices Stats: 2025-12-04T09:41:43.0245554Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.0245646Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0245730Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0245833Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0246305Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0246798Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0247268Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0247844Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0248317Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0248783Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0249252Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0249725Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0250220Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0250691Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0251025Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.0251117Z Autotune Choices Stats: 2025-12-04T09:41:43.0251950Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0252048Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0252135Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0252242Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0252797Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0253271Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0253745Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0254226Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0254713Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0255191Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0255673Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.0256147Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0256617Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0257168Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0257540Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.0257718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0257813Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0257943Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0258189Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0259582Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0259673Z graph_break [] 2025-12-04T09:41:43.0259774Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0259944Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0260034Z Autotune Choices Stats: 2025-12-04T09:41:43.0260864Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0260955Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0261040Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0261147Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0261702Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0262179Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0262653Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0263133Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0263609Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0264097Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0264576Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0265060Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.0265535Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0266004Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0266420Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.0266512Z Autotune Choices Stats: 2025-12-04T09:41:43.0267346Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0267439Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0267522Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0267641Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0268155Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0268634Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0269104Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0269571Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0270040Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0270505Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0271052Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0271523Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0271998Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0272467Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0272796Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.0272974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0273066Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0273201Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0273446Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0274825Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0274913Z graph_break [] 2025-12-04T09:41:43.0275013Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0275263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0275352Z Autotune Choices Stats: 2025-12-04T09:41:43.0276203Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0276298Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0276381Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0276487Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0276972Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0277440Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0281572Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0282079Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0282555Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0283019Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0283493Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0284064Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0284623Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0285181Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0285570Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.0285667Z Autotune Choices Stats: 2025-12-04T09:41:43.0286671Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0286777Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0286869Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0286979Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0287600Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0288159Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0288715Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0289380Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0289935Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0290487Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0291037Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0291598Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0292157Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0292716Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0293045Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.0293222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0293322Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0293454Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0293701Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0295161Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0295249Z graph_break [] 2025-12-04T09:41:43.0295356Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0295529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0295622Z Autotune Choices Stats: 2025-12-04T09:41:43.0296473Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.0296572Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0296661Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0296768Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0297249Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0297784Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0298255Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0298732Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0299284Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0299764Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0300241Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0300881Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0301350Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0301824Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0302160Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.0302251Z Autotune Choices Stats: 2025-12-04T09:41:43.0303085Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0303179Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0303266Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0303374Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0303861Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0304459Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0305022Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0305577Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0306130Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0306684Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0307242Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0307799Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0308353Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0308910Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0309403Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.0309602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0309704Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0309845Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0310128Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0311846Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0311939Z graph_break [] 2025-12-04T09:41:43.0312046Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0312239Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0312334Z Autotune Choices Stats: 2025-12-04T09:41:43.0313331Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0313428Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0313516Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0313624Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0314187Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0314751Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0315388Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0315953Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0316515Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0317121Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0317700Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0318263Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0318817Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0319370Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0319772Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.0319946Z Autotune Choices Stats: 2025-12-04T09:41:43.0320795Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.0320890Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0320980Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0321085Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0321556Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0322026Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0322491Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0322971Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0323436Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0323902Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0324368Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0324843Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0325421Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0325897Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0326231Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.0326406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0326502Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0326637Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0326889Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0328325Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0328409Z graph_break [] 2025-12-04T09:41:43.0328511Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0328687Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0328778Z Autotune Choices Stats: 2025-12-04T09:41:43.0329626Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0329796Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0329884Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0329997Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0330481Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0330957Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0331426Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0331898Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0332369Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0332835Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0333310Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0333790Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0334271Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0334821Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0335155Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.0335253Z Autotune Choices Stats: 2025-12-04T09:41:43.0336086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0336186Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0336280Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0336385Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0336864Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0337344Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0337987Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0339300Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0340336Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0341474Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0342508Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0343552Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0344595Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0345638Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0346572Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.0347202Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0347575Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0347875Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0348346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0349644Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0350780Z graph_break [] 2025-12-04T09:41:43.0351009Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0351380Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0351755Z Autotune Choices Stats: 2025-12-04T09:41:43.0352825Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0353844Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0354102Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0354364Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0355038Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0356086Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0357184Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0358224Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0359261Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0360375Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0361514Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0362572Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0363625Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0364676Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0365580Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.0366102Z Autotune Choices Stats: 2025-12-04T09:41:43.0367139Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0368151Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0368403Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0368661Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0369334Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0370381Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0371419Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0372546Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0373581Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0374610Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0375644Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0376687Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0377738Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0378785Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0379688Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.0380210Z Autotune Choices Stats: 2025-12-04T09:41:43.0381189Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0382285Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0382540Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0382803Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0383494Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0384545Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0385590Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0386656Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0387768Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0388824Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0389885Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0390937Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0392004Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0393250Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0394171Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.0394777Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0395150Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0395450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0395932Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0397230Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0398405Z graph_break [] 2025-12-04T09:41:43.0398631Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0399007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0399379Z Autotune Choices Stats: 2025-12-04T09:41:43.0400549Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0401567Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0401822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0402086Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0402757Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0403954Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0405022Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0406086Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0407146Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0408255Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0409295Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0410331Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0411366Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0412397Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0413306Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.0413827Z Autotune Choices Stats: 2025-12-04T09:41:43.0414923Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0415944Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0416194Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0416463Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0417137Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0418244Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0419317Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0420359Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0421395Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0422432Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0423464Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0424600Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0425653Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0426700Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0427617Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.0428181Z Autotune Choices Stats: 2025-12-04T09:41:43.0429166Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0430198Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0430451Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0430720Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0431401Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0432459Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0433503Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0434661Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0435726Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0436785Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0437858Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0438925Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0440069Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0441103Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0442003Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.0442608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0442980Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0443295Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0443771Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0445154Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0446278Z graph_break [] 2025-12-04T09:41:43.0446510Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0446885Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0447267Z Autotune Choices Stats: 2025-12-04T09:41:43.0448291Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0449316Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0449576Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0449834Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0450517Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0451580Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0452637Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0453689Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0454754Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0455906Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0456958Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0458052Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0459086Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0460123Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0461036Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.0461560Z Autotune Choices Stats: 2025-12-04T09:41:43.0462539Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.0463562Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0463822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0464082Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0464763Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0465927Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0466971Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0468005Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0469051Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0470088Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0471136Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0472179Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0473230Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0474275Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0475181Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.0475705Z Autotune Choices Stats: 2025-12-04T09:41:43.0476779Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0477860Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0478116Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0478383Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0479068Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0480190Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0481253Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0482297Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0483333Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0484371Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0485413Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0486549Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0487649Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0488701Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0489610Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.0490219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0490596Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0490908Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0491385Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0493122Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0494678Z graph_break [] 2025-12-04T09:41:43.0494912Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0495291Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0495660Z Autotune Choices Stats: 2025-12-04T09:41:43.0496644Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0497799Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0498059Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0498317Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0498993Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0500053Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0501267Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0502334Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0503393Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0504431Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0505469Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0506511Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0507683Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0508732Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0509648Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.0510180Z Autotune Choices Stats: 2025-12-04T09:41:43.0511174Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0512221Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0512478Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0512738Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0513424Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0514484Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0515527Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0516569Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0517656Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0518806Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0519915Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0520953Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0522007Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0523062Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0523977Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.0524584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0524963Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0525264Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0525742Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0527526Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0529184Z graph_break [] 2025-12-04T09:41:43.0529420Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0529794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0530162Z Autotune Choices Stats: 2025-12-04T09:41:43.0531150Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0532183Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0532438Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0532704Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0533392Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0534480Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0535529Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0536566Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0537610Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0538659Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0539794Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0540856Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0541904Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0542947Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0543865Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.0544387Z Autotune Choices Stats: 2025-12-04T09:41:43.0545376Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0546395Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0546647Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0546911Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0547614Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0549031Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0550075Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0551121Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0552168Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0553207Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0554267Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0555310Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0556365Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0557451Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0558385Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.0558989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0559369Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0559734Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0560303Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0562030Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0563579Z graph_break [] 2025-12-04T09:41:43.0563811Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0564186Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0564561Z Autotune Choices Stats: 2025-12-04T09:41:43.0565559Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0566600Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0566858Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0567121Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0567793Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0568862Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0570008Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0571073Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0572133Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0573195Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0574241Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0575289Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0581179Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0582224Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0583129Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.0583646Z Autotune Choices Stats: 2025-12-04T09:41:43.0584635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0585680Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0585938Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0586315Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0586994Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0588097Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0589125Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0590160Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0591199Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0592223Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0593253Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0594293Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0595424Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0596478Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0597433Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.0598080Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.0598509Z Traceback (most recent call last): 2025-12-04T09:41:43.0599105Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.0599875Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.0600672Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.0601313Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.0601754Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.0602103Z Searched string: 2025-12-04T09:41:43.0602369Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.0602600Z 2025-12-04T09:41:43.0602716Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.0602935Z 2025-12-04T09:41:43.0603062Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.0603416Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.0603642Z 2025-12-04T09:41:43.0603740Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.0604002Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.0604258Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.0604526Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.0604700Z 2025-12-04T09:41:43.0604794Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.0605045Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.0605312Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.0605575Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.0605745Z 2025-12-04T09:41:43.0605886Z 2025-12-04T09:41:43.0606053Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.0606315Z 2025-12-04T09:41:43.0606320Z 2025-12-04T09:41:43.0606440Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.0606773Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.0607096Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.0607435Z idx_m = rm[:, None] 2025-12-04T09:41:43.0607671Z idx_n = rn[None, :] 2025-12-04T09:41:43.0607913Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.0608096Z 2025-12-04T09:41:43.0608194Z # inductor generates a suffix 2025-12-04T09:41:43.0608468Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.0608851Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.0609248Z ''', device_str='cuda') 2025-12-04T09:41:43.0609396Z 2025-12-04T09:41:43.0609400Z 2025-12-04T09:41:43.0609500Z async_compile.wait(globals()) 2025-12-04T09:41:43.0609763Z del async_compile 2025-12-04T09:41:43.0609896Z 2025-12-04T09:41:43.0609981Z class Runner: 2025-12-04T09:41:43.0610199Z def __init__(self, partitions): 2025-12-04T09:41:43.0610491Z self.partitions = partitions 2025-12-04T09:41:43.0610682Z 2025-12-04T09:41:43.0610795Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.0611084Z new_callables = [] 2025-12-04T09:41:43.0611359Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.0611680Z new_callables.append(fn(c)) 2025-12-04T09:41:43.0611977Z self.partitions = new_callables 2025-12-04T09:41:43.0612178Z 2025-12-04T09:41:43.0612269Z def call(self, args): 2025-12-04T09:41:43.0612513Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.0612879Z args.clear() 2025-12-04T09:41:43.0613144Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.0613488Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.0613817Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.0614111Z torch.cuda.set_device(0) 2025-12-04T09:41:43.0614461Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.0614951Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.0615374Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.0615752Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.0616130Z del arg0_1 2025-12-04T09:41:43.0616430Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.0616953Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.0617415Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.0617873Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.0618277Z del arg1_1 2025-12-04T09:41:43.0618502Z del buf0 2025-12-04T09:41:43.0618718Z return (buf1, ) 2025-12-04T09:41:43.0618862Z 2025-12-04T09:41:43.0618965Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.0619225Z call = runner.call 2025-12-04T09:41:43.0619520Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.0619774Z 2025-12-04T09:41:43.0619778Z 2025-12-04T09:41:43.0619919Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.0620286Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.0620657Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.0621097Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.0621607Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.0622018Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.0622370Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.0622745Z 2025-12-04T09:41:43.0622749Z 2025-12-04T09:41:43.0622847Z if __name__ == "__main__": 2025-12-04T09:41:43.0623198Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.0623659Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.0624001Z From CHECK: .to( 2025-12-04T09:41:43.0624135Z 2025-12-04T09:41:43.0624139Z 2025-12-04T09:41:43.0624314Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.0625142Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.0625797Z 2025-12-04T09:41:43.0626018Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.0626518Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0626889Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0627199Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0627725Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0629458Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0631012Z graph_break [] 2025-12-04T09:41:43.0631238Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0631692Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0632058Z Autotune Choices Stats: 2025-12-04T09:41:43.0633055Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0634072Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0634327Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0634589Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0635265Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0636310Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0637395Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0638416Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0639425Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0640509Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0641535Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0642632Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0643657Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0644673Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0645573Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.0646099Z Autotune Choices Stats: 2025-12-04T09:41:43.0647097Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0648267Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0648521Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0648782Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0649453Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0650489Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0651504Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0652603Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0653621Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0654628Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0655637Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0656657Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0657692Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0658728Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0659632Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.0660226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0660591Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0660888Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0661357Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0662647Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0663853Z graph_break [] 2025-12-04T09:41:43.0664079Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0664444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0664808Z Autotune Choices Stats: 2025-12-04T09:41:43.0665798Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0666805Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0667051Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0667311Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0668027Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0669061Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0670094Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0671136Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0672182Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0673323Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0674366Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0675385Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0676416Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0677447Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0678388Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.0678899Z Autotune Choices Stats: 2025-12-04T09:41:43.0679956Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0680965Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0681214Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0681464Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0682128Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0683161Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0684285Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0685311Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0686327Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0687346Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0688377Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0689400Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0690431Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0691465Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0692364Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.0693011Z Autotune Choices Stats: 2025-12-04T09:41:43.0693994Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.0695008Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0695256Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0695516Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0696186Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0697278Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0698296Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0699328Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0700519Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0701564Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.0702610Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0703658Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0704845Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0705878Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0706767Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.0707373Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0707790Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0708088Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0708574Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0709872Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0710999Z graph_break [] 2025-12-04T09:41:43.0711222Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0711597Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0711965Z Autotune Choices Stats: 2025-12-04T09:41:43.0712951Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0714117Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0714374Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0714635Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0715311Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0716344Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0717373Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0718392Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0719426Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0720547Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0721580Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0722616Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0723648Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0724673Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0725640Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.0726154Z Autotune Choices Stats: 2025-12-04T09:41:43.0727181Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.0728191Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0728441Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0728699Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0729375Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0730414Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0731437Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0731910Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0732370Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0732913Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0733374Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0733848Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0734314Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0734778Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0735116Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.0735206Z Autotune Choices Stats: 2025-12-04T09:41:43.0736046Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0736144Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0736228Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0736343Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0736813Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0737313Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0737902Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0738364Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0738829Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0739289Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0739754Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0740218Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0740682Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0741148Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0741473Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.0741653Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0741824Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0741959Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0742206Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0743588Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0748622Z graph_break [] 2025-12-04T09:41:43.0748733Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0748921Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0749016Z Autotune Choices Stats: 2025-12-04T09:41:43.0749873Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0749977Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0750065Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0750172Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0750656Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0751117Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0751603Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0752174Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0752641Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0753114Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0753585Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0754053Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0754532Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0754995Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0755326Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.0755418Z Autotune Choices Stats: 2025-12-04T09:41:43.0756246Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0756414Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0756500Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0756614Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0757094Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0757615Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0758095Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0758639Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0759108Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0759627Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0760099Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0760571Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0761046Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0761605Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0761933Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.0762114Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0762209Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0762340Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0762589Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0763534Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0763622Z graph_break [] 2025-12-04T09:41:43.0763728Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0763905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0764004Z Autotune Choices Stats: 2025-12-04T09:41:43.0764833Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.0764931Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0765015Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0765119Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0765606Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0766127Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0766600Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0767069Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0767644Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0768113Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0768584Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0769062Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0769532Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0770013Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0770345Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.0770438Z Autotune Choices Stats: 2025-12-04T09:41:43.0771350Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0771445Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0771534Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0771636Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0772107Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0772589Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0773064Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0773540Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0774010Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0774475Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0774953Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0775472Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0775948Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0776420Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0776794Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.0776884Z Autotune Choices Stats: 2025-12-04T09:41:43.0777721Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0777826Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0777913Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0778030Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0778511Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0778991Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0779463Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0779931Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0780487Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0780955Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0781426Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0781901Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0782373Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0782851Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0783179Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.0783356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0783450Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0783581Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0783828Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0784816Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0784905Z graph_break [] 2025-12-04T09:41:43.0785011Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0785183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0785276Z Autotune Choices Stats: 2025-12-04T09:41:43.0786105Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0786251Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0786338Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0786447Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0786929Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0787431Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0787929Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0788424Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0788905Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0789493Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0789965Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0790444Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0790917Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0791388Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0791726Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.0791820Z Autotune Choices Stats: 2025-12-04T09:41:43.0792654Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.0792745Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0792829Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0792937Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0793413Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0793951Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0794431Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0794905Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0795375Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0795886Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0796359Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0796832Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0797321Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0797827Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0798166Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.0798264Z Autotune Choices Stats: 2025-12-04T09:41:43.0799168Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0799265Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0799352Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0799464Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0800038Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0800667Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0801148Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0801633Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0802113Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0802589Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0803075Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.0803625Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0804102Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0804573Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0804900Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.0805144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0805244Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0805378Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0805634Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0807013Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0807103Z graph_break [] 2025-12-04T09:41:43.0807210Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0807414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0807523Z Autotune Choices Stats: 2025-12-04T09:41:43.0808376Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0808479Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0808674Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0808783Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0809262Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0809733Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0810212Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0810699Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0811181Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0811669Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0812145Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0812634Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.0813155Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0813632Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0813969Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.0814059Z Autotune Choices Stats: 2025-12-04T09:41:43.0814898Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0815038Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0815122Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0815233Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0815729Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0816204Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0816675Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0817207Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0817676Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0818216Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0818689Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0819160Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0819639Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0820111Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0820447Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.0820621Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0820713Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0820849Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0821094Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0822477Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0822609Z graph_break [] 2025-12-04T09:41:43.0822719Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0822895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0822984Z Autotune Choices Stats: 2025-12-04T09:41:43.0823847Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0823996Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0824084Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0824193Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0824682Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0825162Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0825634Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0826100Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0826580Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0827048Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0827703Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0828184Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0828654Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0829138Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0829471Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.0829567Z Autotune Choices Stats: 2025-12-04T09:41:43.0830407Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0830501Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0830587Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0830691Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0831187Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0831660Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0832185Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0832664Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0833130Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0833641Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0834109Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0834592Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0835066Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0835546Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0835880Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.0836053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0836156Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0836286Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0836612Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0837993Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0838078Z graph_break [] 2025-12-04T09:41:43.0838185Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0838357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0838453Z Autotune Choices Stats: 2025-12-04T09:41:43.0839299Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.0839389Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0839532Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0839637Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0840113Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0840597Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0841067Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0841602Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0842075Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0842558Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0843077Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0843546Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0844018Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0844480Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0844813Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.0844906Z Autotune Choices Stats: 2025-12-04T09:41:43.0845740Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0845840Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0845926Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0846109Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0846589Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0851220Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0851729Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0852210Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0852688Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0853156Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0853626Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0854104Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0854580Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0855128Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0855463Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.0855643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0855738Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0855873Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0856170Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0857600Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0857691Z graph_break [] 2025-12-04T09:41:43.0857798Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0857978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0858072Z Autotune Choices Stats: 2025-12-04T09:41:43.0858920Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0859021Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0859111Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0859223Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0859780Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0860257Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0860736Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0861220Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0861707Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0862192Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0862677Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0863151Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0863627Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0864099Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0864480Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.0864578Z Autotune Choices Stats: 2025-12-04T09:41:43.0865400Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.0865494Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0865654Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0865762Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0866241Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0866711Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0867181Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0867656Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0868122Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0868594Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0869135Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0869611Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0870084Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0870555Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0870889Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.0871069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0871168Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0871307Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0871558Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0872931Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0873017Z graph_break [] 2025-12-04T09:41:43.0873125Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0873343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0873435Z Autotune Choices Stats: 2025-12-04T09:41:43.0874291Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0874388Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0874479Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0874585Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0875069Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0875596Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0876069Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0876547Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0877056Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0877528Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0878006Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0878608Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0879083Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0879637Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0879975Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.0880071Z Autotune Choices Stats: 2025-12-04T09:41:43.0880909Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0881013Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0881102Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0881213Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0881692Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0882167Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0882646Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0883159Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0883632Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0884104Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0884575Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0885096Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0885572Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0886050Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0886382Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.0886559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0886658Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0886790Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0887045Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0888115Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0888204Z graph_break [] 2025-12-04T09:41:43.0888311Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0888484Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0888582Z Autotune Choices Stats: 2025-12-04T09:41:43.0889411Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0889514Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0889602Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0889711Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0890194Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0890666Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0891140Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0891607Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0892083Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0892607Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0893084Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0893568Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0894096Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0894570Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0894902Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.0894998Z Autotune Choices Stats: 2025-12-04T09:41:43.0895834Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0895928Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0896020Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0896126Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0896603Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0897079Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0897623Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0898106Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0898574Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0899044Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0899515Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0899991Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0900635Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0901112Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0901447Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.0901646Z Autotune Choices Stats: 2025-12-04T09:41:43.0902483Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0902582Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0902669Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0902781Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0903267Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0903807Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0904288Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0904771Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0905251Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0905726Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0906207Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0906694Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0907284Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0907819Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0908150Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.0908333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0908430Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0908569Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0908816Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0909765Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0909852Z graph_break [] 2025-12-04T09:41:43.0909957Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0910134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0910232Z Autotune Choices Stats: 2025-12-04T09:41:43.0911068Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0911205Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0911295Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0911403Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0911895Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0912377Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0912855Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0913395Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0913873Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0914350Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0914822Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0915295Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0915764Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0916309Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0916644Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.0916738Z Autotune Choices Stats: 2025-12-04T09:41:43.0917619Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0917716Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0917802Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0917909Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0918390Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0918872Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0919342Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0919925Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0920400Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0920916Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0921391Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0921864Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0922340Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0922856Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0923194Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.0923289Z Autotune Choices Stats: 2025-12-04T09:41:43.0924121Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0924217Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0924304Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0924417Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0924900Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0925380Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0925932Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0926414Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0926895Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0927378Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0927862Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0928344Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0928813Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0929284Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0929614Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.0929833Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0929928Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0930061Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0930317Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0931263Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0931346Z graph_break [] 2025-12-04T09:41:43.0931459Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0931676Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0931771Z Autotune Choices Stats: 2025-12-04T09:41:43.0932608Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0932711Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0932801Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0932907Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0933392Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0933866Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0934352Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0934837Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0935424Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0935903Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0936370Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0936853Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0937374Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0937839Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0938175Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.0938267Z Autotune Choices Stats: 2025-12-04T09:41:43.0939092Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.0939232Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0939319Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0939427Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0939906Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0940379Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0940847Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0941355Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0941838Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0942306Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0942776Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0943247Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0943726Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0944200Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0944602Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.0944700Z Autotune Choices Stats: 2025-12-04T09:41:43.0945556Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0945660Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0945748Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0945859Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.0946342Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0946831Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0947340Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0947806Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0948275Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0948784Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0949262Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0949740Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0950211Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0950730Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0951059Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.0951233Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0951333Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0951464Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0951713Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0953088Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0953175Z graph_break [] 2025-12-04T09:41:43.0953282Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0953456Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0953554Z Autotune Choices Stats: 2025-12-04T09:41:43.0954455Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0954552Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0954642Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0954747Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0955226Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0955707Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0956194Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0956669Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0957143Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0957616Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0958130Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0958603Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0959074Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0959658Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0960038Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.0960132Z Autotune Choices Stats: 2025-12-04T09:41:43.0960993Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0961088Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0961174Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0961281Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0961762Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0962236Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0962703Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0963247Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0963718Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0964185Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0964656Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0965129Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0965611Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0966082Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0966411Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.0966592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0966686Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0966821Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0967069Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0968542Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0968629Z graph_break [] 2025-12-04T09:41:43.0968735Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0968912Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0969048Z Autotune Choices Stats: 2025-12-04T09:41:43.0969890Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.0969990Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0970077Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0970193Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0970678Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0971152Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0971626Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0972097Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0972672Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0973148Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0973621Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0974107Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0974586Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0975067Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0975399Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.0975493Z Autotune Choices Stats: 2025-12-04T09:41:43.0976316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.0976412Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0976501Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0976647Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0977174Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0977648Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0978117Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0978597Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0979105Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0979582Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0980048Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0980525Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0981000Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0981473Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0981811Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.0982060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0982159Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0982292Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0982540Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0983920Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0984009Z graph_break [] 2025-12-04T09:41:43.0984115Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0984295Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0984386Z Autotune Choices Stats: 2025-12-04T09:41:43.0985227Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0985319Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0985411Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0985517Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.0986003Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0986530Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0987010Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0987491Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0988012Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0988482Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0988963Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0989430Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0989900Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0990369Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0990706Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.0990803Z Autotune Choices Stats: 2025-12-04T09:41:43.0991734Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.0991834Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.0991921Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.0992030Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.0992513Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0992991Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0993468Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.0993948Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.0994426Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.0994895Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0995362Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.0995889Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0996364Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.0996842Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.0997171Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.0997392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.0997506Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.0997662Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.0997931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.0999318Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.0999404Z graph_break [] 2025-12-04T09:41:43.0999626Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.0999803Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.0999897Z Autotune Choices Stats: 2025-12-04T09:41:43.1000856Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1001082Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1001176Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1001283Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1001772Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1002253Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1002738Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1003221Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1003702Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1004190Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1004661Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1005140Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1005672Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1006144Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1006478Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.1006572Z Autotune Choices Stats: 2025-12-04T09:41:43.1007442Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1007640Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1007730Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1007837Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1008319Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1008800Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1009276Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1009759Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1010227Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1010770Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1011242Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1011714Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1012193Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1012669Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1013009Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.1013232Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.1013335Z Traceback (most recent call last): 2025-12-04T09:41:43.1013759Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.1013949Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.1014293Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.1014479Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.1014685Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.1014780Z Searched string: 2025-12-04T09:41:43.1014919Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.1014928Z 2025-12-04T09:41:43.1015052Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.1015057Z 2025-12-04T09:41:43.1015192Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.1015319Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.1015323Z 2025-12-04T09:41:43.1015419Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.1015515Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.1015652Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1015748Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.1015752Z 2025-12-04T09:41:43.1015846Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.1015936Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.1016034Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1016121Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.1016126Z 2025-12-04T09:41:43.1016129Z 2025-12-04T09:41:43.1016296Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.1016304Z 2025-12-04T09:41:43.1016308Z 2025-12-04T09:41:43.1016430Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.1016545Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.1016662Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.1016753Z idx_m = rm[:, None] 2025-12-04T09:41:43.1016836Z idx_n = rn[None, :] 2025-12-04T09:41:43.1016941Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.1016948Z 2025-12-04T09:41:43.1017048Z # inductor generates a suffix 2025-12-04T09:41:43.1017141Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1017352Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.1017441Z ''', device_str='cuda') 2025-12-04T09:41:43.1017445Z 2025-12-04T09:41:43.1017449Z 2025-12-04T09:41:43.1017554Z async_compile.wait(globals()) 2025-12-04T09:41:43.1017638Z del async_compile 2025-12-04T09:41:43.1017643Z 2025-12-04T09:41:43.1017725Z class Runner: 2025-12-04T09:41:43.1017907Z def __init__(self, partitions): 2025-12-04T09:41:43.1018015Z self.partitions = partitions 2025-12-04T09:41:43.1018020Z 2025-12-04T09:41:43.1018136Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.1018230Z new_callables = [] 2025-12-04T09:41:43.1018348Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.1018457Z new_callables.append(fn(c)) 2025-12-04T09:41:43.1018561Z self.partitions = new_callables 2025-12-04T09:41:43.1018567Z 2025-12-04T09:41:43.1018658Z def call(self, args): 2025-12-04T09:41:43.1018750Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.1018833Z args.clear() 2025-12-04T09:41:43.1018965Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.1019094Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.1019203Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.1019303Z torch.cuda.set_device(0) 2025-12-04T09:41:43.1019476Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.1019696Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.1019801Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.1019996Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.1020084Z del arg0_1 2025-12-04T09:41:43.1020252Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.1020510Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.1020613Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.1020877Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.1020962Z del arg1_1 2025-12-04T09:41:43.1021046Z del buf0 2025-12-04T09:41:43.1021132Z return (buf1, ) 2025-12-04T09:41:43.1021136Z 2025-12-04T09:41:43.1021242Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.1021331Z call = runner.call 2025-12-04T09:41:43.1021491Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.1021495Z 2025-12-04T09:41:43.1021499Z 2025-12-04T09:41:43.1021642Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.1021776Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.1021925Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.1022179Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.1022380Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.1022488Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.1022655Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.1022660Z 2025-12-04T09:41:43.1022664Z 2025-12-04T09:41:43.1022755Z if __name__ == "__main__": 2025-12-04T09:41:43.1022974Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.1023134Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.1023221Z From CHECK: .to( 2025-12-04T09:41:43.1023225Z 2025-12-04T09:41:43.1023229Z 2025-12-04T09:41:43.1023410Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.1023963Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.1023971Z 2025-12-04T09:41:43.1024196Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.1024374Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1024474Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1024609Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1024986Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1026366Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1026454Z graph_break [] 2025-12-04T09:41:43.1026561Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1026741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1026839Z Autotune Choices Stats: 2025-12-04T09:41:43.1027735Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1027835Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1027923Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1028037Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1028517Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1028985Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1029492Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1029957Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1030420Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1030891Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1031393Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1031855Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1032326Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1032794Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1033126Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.1033231Z Autotune Choices Stats: 2025-12-04T09:41:43.1034057Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1034159Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1034248Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1034432Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1034915Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1035374Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1035835Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1036295Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1036760Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1037226Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1037737Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1038212Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1038676Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1039223Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1039609Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.1039784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1039886Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1040019Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1040319Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1041268Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1041358Z graph_break [] 2025-12-04T09:41:43.1041468Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1041646Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1041741Z Autotune Choices Stats: 2025-12-04T09:41:43.1042575Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1042671Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1042764Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1042870Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1043342Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1043906Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1044378Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1044858Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1045341Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1045830Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1046303Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1046775Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1047250Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1047715Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1048091Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.1048185Z Autotune Choices Stats: 2025-12-04T09:41:43.1049017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1049116Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1049204Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1049313Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1049784Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1050296Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1050776Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1051241Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1051708Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1052170Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1052648Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1053203Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1053674Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1054150Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1054484Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.1054582Z Autotune Choices Stats: 2025-12-04T09:41:43.1055411Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.1055519Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1055609Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1055723Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1056209Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1056672Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1057141Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1057703Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1058179Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1058657Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1059123Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1059657Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1060126Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1060602Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1060933Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.1061107Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1061209Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1061342Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1061590Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1062622Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1062707Z graph_break [] 2025-12-04T09:41:43.1062816Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1062991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1063084Z Autotune Choices Stats: 2025-12-04T09:41:43.1063916Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1064014Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1064107Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1064218Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1064700Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1065168Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1065629Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1066098Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1066570Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1067088Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1067574Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1068086Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1068591Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1069055Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1069391Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.1069488Z Autotune Choices Stats: 2025-12-04T09:41:43.1070314Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.1070413Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1070500Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1070610Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1071084Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1071563Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1072128Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1072595Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1073069Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1073535Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1074005Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1074478Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1074948Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1075433Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1075765Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.1075904Z Autotune Choices Stats: 2025-12-04T09:41:43.1076739Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1076836Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1076929Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1077041Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1077526Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1078038Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1078520Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1078992Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1079458Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1079990Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1080460Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1080929Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1081472Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1081945Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1082283Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.1082462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1082562Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1082694Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1082940Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1084323Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1084408Z graph_break [] 2025-12-04T09:41:43.1084518Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1084693Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1084788Z Autotune Choices Stats: 2025-12-04T09:41:43.1085632Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1085770Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1085865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1085975Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1086459Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1086951Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1087478Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1087952Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1088424Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1088893Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1089369Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1089841Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1090315Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1090851Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1091197Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.1091293Z Autotune Choices Stats: 2025-12-04T09:41:43.1092144Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1092251Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1092342Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1092451Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1092933Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1093405Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1093885Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1094353Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1094822Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1095334Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1095807Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1096284Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1096798Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1097279Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1097665Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.1097850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1097946Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1098077Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1098332Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1099279Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1099369Z graph_break [] 2025-12-04T09:41:43.1099474Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1099649Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1099747Z Autotune Choices Stats: 2025-12-04T09:41:43.1100846Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.1100950Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1101039Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1101147Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1101646Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1102121Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1102594Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1103071Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1103537Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1104012Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1104539Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1105020Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1105493Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1105968Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1106356Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.1106453Z Autotune Choices Stats: 2025-12-04T09:41:43.1107296Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1107391Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1107482Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1107599Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1108078Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1108558Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1109033Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1109612Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1110083Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1110550Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1111024Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1111499Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1111980Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1116014Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1116368Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.1116464Z Autotune Choices Stats: 2025-12-04T09:41:43.1117337Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1117522Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1117607Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1117718Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1118209Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1118686Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1119161Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1119753Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1120236Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1120701Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1121164Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1121642Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1122111Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1122664Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1122996Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.1123176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1123270Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1123401Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1123653Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1124590Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1124676Z graph_break [] 2025-12-04T09:41:43.1124781Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1124957Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1125051Z Autotune Choices Stats: 2025-12-04T09:41:43.1125881Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1125975Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1126065Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1126169Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1126647Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1127169Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1127688Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1128168Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1128681Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1129162Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1129633Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1130109Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1130572Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1131044Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1131379Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.1131469Z Autotune Choices Stats: 2025-12-04T09:41:43.1132375Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1132468Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1132552Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1132657Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1133130Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1133613Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1134090Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1134558Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1135031Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1135497Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1135965Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1136487Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1136961Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1137435Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1137799Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.1137893Z Autotune Choices Stats: 2025-12-04T09:41:43.1138724Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1138827Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1138911Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1139019Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1139497Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1139965Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1140443Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1140996Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1141473Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1141951Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1142432Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1142901Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1143374Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1143841Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1144167Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.1144340Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1144436Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1144566Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1144814Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1146236Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1146317Z graph_break [] 2025-12-04T09:41:43.1146426Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1146597Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1146759Z Autotune Choices Stats: 2025-12-04T09:41:43.1147612Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1147713Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1147819Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1147922Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1148405Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1148875Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1149345Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1149826Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1150379Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1150859Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1151335Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1151825Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1152302Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1152775Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1153106Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.1153196Z Autotune Choices Stats: 2025-12-04T09:41:43.1154036Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1154130Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1154214Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1154361Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1154845Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1155323Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1155788Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1156254Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1156762Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1157278Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1157747Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1158216Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1158688Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1159161Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1159573Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.1159830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1159925Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1160058Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1160303Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1161679Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1161770Z graph_break [] 2025-12-04T09:41:43.1161873Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1162047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1162139Z Autotune Choices Stats: 2025-12-04T09:41:43.1162979Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1163073Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1163160Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1163265Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1163747Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1164263Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1164737Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1165202Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1165667Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1166173Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1166652Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1167116Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1167586Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1168058Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1168389Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.1168484Z Autotune Choices Stats: 2025-12-04T09:41:43.1169394Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1169487Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1169575Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1169678Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1170154Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1170627Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1171098Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1171572Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1172038Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1172504Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1172970Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1173485Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1173959Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1174432Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1174765Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.1175028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1175124Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1175256Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1175499Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1176874Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1176956Z graph_break [] 2025-12-04T09:41:43.1177084Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1177283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1177372Z Autotune Choices Stats: 2025-12-04T09:41:43.1178201Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.1178371Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1178463Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1178566Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1179041Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1179515Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1179989Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1180468Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1180940Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1181416Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1181901Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1182368Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1182903Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1183370Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1183704Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.1183793Z Autotune Choices Stats: 2025-12-04T09:41:43.1184629Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1184766Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1184850Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1184955Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1185433Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1185906Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1186379Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1186851Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1187369Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1187909Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1188379Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1188850Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1189320Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1189796Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1190127Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.1190301Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1190393Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1190522Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1190765Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1192139Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1192265Z graph_break [] 2025-12-04T09:41:43.1192371Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1192542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1192634Z Autotune Choices Stats: 2025-12-04T09:41:43.1193463Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1193597Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1193681Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1193784Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1194263Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1194739Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1195210Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1195685Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1196165Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1196649Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1197207Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1197691Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1198161Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1198631Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1198961Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.1199050Z Autotune Choices Stats: 2025-12-04T09:41:43.1199957Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1200048Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1200134Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1200404Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1200885Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1201355Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1201899Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1202370Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1202835Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1203357Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1203825Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1204297Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1204770Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1205241Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1205577Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.1205749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1205843Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1205977Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1206326Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1207753Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1207838Z graph_break [] 2025-12-04T09:41:43.1207940Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1208114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1208207Z Autotune Choices Stats: 2025-12-04T09:41:43.1209072Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1209164Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1209248Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1209356Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1209835Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1210307Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1210775Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1211287Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1211754Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1212217Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1212735Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1213206Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1213681Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1214152Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1214480Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.1214576Z Autotune Choices Stats: 2025-12-04T09:41:43.1215408Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1215504Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1215588Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1215791Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1216276Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1216747Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1217219Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1217737Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1218208Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1218674Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1219138Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1219616Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1220084Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1220600Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1220929Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.1221101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1221195Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1221325Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1221609Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1222550Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1222636Z graph_break [] 2025-12-04T09:41:43.1222743Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1222914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1223003Z Autotune Choices Stats: 2025-12-04T09:41:43.1223836Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1223929Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1224015Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1224118Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1224595Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1225146Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1225615Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1226082Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1226556Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1227034Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1227510Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1227986Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1228471Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1228940Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1229315Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.1229405Z Autotune Choices Stats: 2025-12-04T09:41:43.1230236Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1230333Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1230417Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1230522Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1230994Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1231503Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1231976Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1232441Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1232906Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1233372Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1233838Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1234393Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1234866Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1235338Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1235668Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.1235760Z Autotune Choices Stats: 2025-12-04T09:41:43.1236594Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1236688Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1236772Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1236880Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1237362Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1237880Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1238352Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1238877Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1239349Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1239887Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1240406Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1240884Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1241373Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1241851Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1242182Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.1242354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1242451Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1242580Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1242825Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1243848Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1243931Z graph_break [] 2025-12-04T09:41:43.1244036Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1244207Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1244296Z Autotune Choices Stats: 2025-12-04T09:41:43.1245127Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1245221Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1245309Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1245412Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1245889Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1246369Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1246843Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1247328Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1247883Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1248354Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1248821Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1249284Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1249789Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1250256Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1250591Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.1250681Z Autotune Choices Stats: 2025-12-04T09:41:43.1251515Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1251611Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1251694Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1251799Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1252270Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1252843Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1253314Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1253782Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1254254Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1254721Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1255203Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1255675Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1256145Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1256624Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1256959Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.1257099Z Autotune Choices Stats: 2025-12-04T09:41:43.1257930Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1258021Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1258108Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1258217Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1258690Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1259210Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1259682Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1260164Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1260636Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1261115Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1261600Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1262155Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1262621Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1263087Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1263421Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.1263592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1263691Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1263821Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1264063Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1265009Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1265090Z graph_break [] 2025-12-04T09:41:43.1265199Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1265370Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1265462Z Autotune Choices Stats: 2025-12-04T09:41:43.1266303Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1266439Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1266523Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1266636Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1267166Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1267643Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1268155Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1268636Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1269124Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1269596Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1270064Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1270536Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1271007Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1271547Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1271882Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.1271972Z Autotune Choices Stats: 2025-12-04T09:41:43.1272813Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1272911Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1272995Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1273102Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1273584Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1274055Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1274525Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1274991Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1275464Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1275974Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1276438Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1276913Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1277428Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1277950Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1278280Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.1278379Z Autotune Choices Stats: 2025-12-04T09:41:43.1279209Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1279300Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1279391Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1279576Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1280058Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1280532Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1281074Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1281551Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1282016Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1282487Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1282959Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1283438Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1283917Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1284395Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1284726Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.1284937Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1285033Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1285162Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1285408Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1286791Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1286937Z graph_break [] 2025-12-04T09:41:43.1287046Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1287221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1287310Z Autotune Choices Stats: 2025-12-04T09:41:43.1288195Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1288286Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1288371Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1288480Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1288957Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1289435Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1289914Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1290463Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1290940Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1291406Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1291883Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1292356Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1292839Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1293311Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1293642Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.1293734Z Autotune Choices Stats: 2025-12-04T09:41:43.1294577Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1294715Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1294803Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1294906Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1295385Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1295854Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1296365Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1296852Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1297351Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1297813Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1298276Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1298755Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1299231Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1299784Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1300116Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.1300447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1300543Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1300677Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1300927Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1302300Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1302386Z graph_break [] 2025-12-04T09:41:43.1302491Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1302661Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1302753Z Autotune Choices Stats: 2025-12-04T09:41:43.1303596Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1303762Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1303846Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1303951Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1304435Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1304903Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1305371Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1305900Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1306367Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1306869Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1307363Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1307836Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1308304Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1308781Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1309211Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.1309304Z Autotune Choices Stats: 2025-12-04T09:41:43.1310151Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1310250Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1310339Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1310444Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1310915Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1311394Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1311857Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1312330Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1312800Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1313310Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1313780Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1314252Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1314728Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1315247Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1315583Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.1315756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1315853Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1315989Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1316234Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1317689Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1317776Z graph_break [] 2025-12-04T09:41:43.1317878Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1318052Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1318141Z Autotune Choices Stats: 2025-12-04T09:41:43.1319057Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1319155Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1319240Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1319346Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1319893Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1320372Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1320856Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1321331Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1321815Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1322284Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1322916Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1323383Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1323849Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1324317Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1324697Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.1324793Z Autotune Choices Stats: 2025-12-04T09:41:43.1325638Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1325733Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1325819Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1325919Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1326402Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1326872Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1327336Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1327897Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1328362Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1328829Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1329299Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1329780Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1330252Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1330734Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1331062Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.1331235Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1331331Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1331460Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1331748Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1333131Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1333213Z graph_break [] 2025-12-04T09:41:43.1333318Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1333494Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1333627Z Autotune Choices Stats: 2025-12-04T09:41:43.1334483Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1334576Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1334666Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1334772Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1335253Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1335734Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1336208Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1336684Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1337242Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1337776Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1338242Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1338713Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1339182Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1339651Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1339984Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.1340073Z Autotune Choices Stats: 2025-12-04T09:41:43.1340910Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1341006Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1341132Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1341239Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1341719Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1342191Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1342665Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1343135Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1343653Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1344122Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1344590Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1345061Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1345532Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1346004Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1346334Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.1346586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1346681Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1346814Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1347063Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1348058Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1348150Z graph_break [] 2025-12-04T09:41:43.1348252Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1348425Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1348518Z Autotune Choices Stats: 2025-12-04T09:41:43.1349358Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.1349452Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1349537Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1349639Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1350131Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1350599Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1351120Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1351594Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1352061Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1352591Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1353061Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1353536Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1354002Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1354475Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1354805Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.1354895Z Autotune Choices Stats: 2025-12-04T09:41:43.1355824Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1355922Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1356010Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1356112Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1356585Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1357065Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1357532Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1358007Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1358471Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1358936Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1359407Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1359927Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1360451Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1360922Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1361252Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.1361343Z Autotune Choices Stats: 2025-12-04T09:41:43.1362224Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1362325Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1362409Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1362523Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1362999Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1363473Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1363949Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1364422Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1364982Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1365456Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1365936Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1366423Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1366893Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1367397Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1367746Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.1367924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1368018Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1368147Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1368399Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1369780Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1369908Z graph_break [] 2025-12-04T09:41:43.1370012Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1370184Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1370279Z Autotune Choices Stats: 2025-12-04T09:41:43.1371121Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1371256Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1371343Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1371449Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1371933Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1372408Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1372884Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1373357Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1373836Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1374384Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1374855Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1375328Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1375801Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1376274Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1376609Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.1376709Z Autotune Choices Stats: 2025-12-04T09:41:43.1377595Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1377687Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1377778Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1377881Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1378358Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1378875Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1379349Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1379818Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1380281Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1380787Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1381264Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1381739Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1382212Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1382679Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1383012Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.1383235Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.1383339Z Traceback (most recent call last): 2025-12-04T09:41:43.1384221Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.1384406Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.1384758Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.1384938Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.1385098Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.1385188Z Searched string: 2025-12-04T09:41:43.1385322Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.1385327Z 2025-12-04T09:41:43.1385442Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.1385449Z 2025-12-04T09:41:43.1385578Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.1385702Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.1385707Z 2025-12-04T09:41:43.1385803Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.1385894Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.1385985Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1386082Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.1386086Z 2025-12-04T09:41:43.1386173Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.1386264Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.1386356Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1386441Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.1386448Z 2025-12-04T09:41:43.1386452Z 2025-12-04T09:41:43.1386617Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.1386621Z 2025-12-04T09:41:43.1386625Z 2025-12-04T09:41:43.1386743Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.1386903Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.1387042Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.1387138Z idx_m = rm[:, None] 2025-12-04T09:41:43.1387239Z idx_n = rn[None, :] 2025-12-04T09:41:43.1392217Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.1392226Z 2025-12-04T09:41:43.1392347Z # inductor generates a suffix 2025-12-04T09:41:43.1392452Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1392675Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.1392765Z ''', device_str='cuda') 2025-12-04T09:41:43.1392770Z 2025-12-04T09:41:43.1392773Z 2025-12-04T09:41:43.1392948Z async_compile.wait(globals()) 2025-12-04T09:41:43.1393033Z del async_compile 2025-12-04T09:41:43.1393038Z 2025-12-04T09:41:43.1393124Z class Runner: 2025-12-04T09:41:43.1393229Z def __init__(self, partitions): 2025-12-04T09:41:43.1393334Z self.partitions = partitions 2025-12-04T09:41:43.1393341Z 2025-12-04T09:41:43.1393457Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.1393549Z new_callables = [] 2025-12-04T09:41:43.1393666Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.1393782Z new_callables.append(fn(c)) 2025-12-04T09:41:43.1393888Z self.partitions = new_callables 2025-12-04T09:41:43.1393892Z 2025-12-04T09:41:43.1393983Z def call(self, args): 2025-12-04T09:41:43.1394077Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.1394161Z args.clear() 2025-12-04T09:41:43.1394294Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.1394420Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.1394529Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.1394628Z torch.cuda.set_device(0) 2025-12-04T09:41:43.1394796Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.1395022Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.1395124Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.1395314Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.1395494Z del arg0_1 2025-12-04T09:41:43.1395663Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.1395921Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.1396021Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.1396245Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.1396331Z del arg1_1 2025-12-04T09:41:43.1396413Z del buf0 2025-12-04T09:41:43.1396498Z return (buf1, ) 2025-12-04T09:41:43.1396502Z 2025-12-04T09:41:43.1396605Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.1396695Z call = runner.call 2025-12-04T09:41:43.1396862Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.1396867Z 2025-12-04T09:41:43.1396871Z 2025-12-04T09:41:43.1397014Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.1397150Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.1397325Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.1397555Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.1397760Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.1397863Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.1398029Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.1398036Z 2025-12-04T09:41:43.1398040Z 2025-12-04T09:41:43.1398131Z if __name__ == "__main__": 2025-12-04T09:41:43.1398338Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.1398547Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.1398632Z From CHECK: .to( 2025-12-04T09:41:43.1398636Z 2025-12-04T09:41:43.1398643Z 2025-12-04T09:41:43.1398819Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.1399393Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.1399399Z 2025-12-04T09:41:43.1399756Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.1399940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1400038Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1400221Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1400619Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1402024Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1402114Z graph_break [] 2025-12-04T09:41:43.1402219Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1402398Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1402490Z Autotune Choices Stats: 2025-12-04T09:41:43.1403339Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1403443Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1403533Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1403644Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1404272Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1404749Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1405218Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1405682Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1406147Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1406625Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1407109Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1407591Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1408051Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1408575Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1408914Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.1409009Z Autotune Choices Stats: 2025-12-04T09:41:43.1409855Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1410008Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1410096Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1410203Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1410685Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1411149Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1411608Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1412061Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1412527Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1412986Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1413533Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1414002Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1414468Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1414936Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1415271Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.1415444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1415544Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1415680Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1415930Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1416868Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1416957Z graph_break [] 2025-12-04T09:41:43.1417063Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1417235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1417371Z Autotune Choices Stats: 2025-12-04T09:41:43.1418207Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1418303Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1418394Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1418500Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1418975Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1419497Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1419967Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1420451Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1420926Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1421408Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1421875Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1422344Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1422928Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1423394Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1423729Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.1423824Z Autotune Choices Stats: 2025-12-04T09:41:43.1424660Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1424756Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1424846Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1424957Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1425428Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1425894Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1426364Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1426825Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1427387Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1427849Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1428312Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1428879Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1429353Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1429828Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1430157Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.1430254Z Autotune Choices Stats: 2025-12-04T09:41:43.1431085Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.1431189Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1431277Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1431391Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1431871Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1432443Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1432906Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1433374Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1433855Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1434342Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1434808Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1435284Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1435749Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1436220Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1436592Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.1436772Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1436874Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1437006Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1437252Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1438246Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1438374Z graph_break [] 2025-12-04T09:41:43.1438485Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1438657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1438754Z Autotune Choices Stats: 2025-12-04T09:41:43.1439651Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1439745Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1439841Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1439948Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1440422Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1440891Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1441443Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1441909Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1442379Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1442854Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1443320Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1443797Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1444264Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1444726Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1445061Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.1445155Z Autotune Choices Stats: 2025-12-04T09:41:43.1445977Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.1446122Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1446210Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1446320Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1446786Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1447252Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1447760Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1448224Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1448699Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1449160Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1449625Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1450098Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1450656Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1451136Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1451466Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.1451565Z Autotune Choices Stats: 2025-12-04T09:41:43.1452398Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1452498Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1452589Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1452700Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1453181Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1453653Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1454130Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1454597Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1455105Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1455574Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1456034Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1456498Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1457026Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1457525Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1457889Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.1458063Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1458162Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1458292Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1458539Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1459923Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1460011Z graph_break [] 2025-12-04T09:41:43.1460210Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1460387Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1460482Z Autotune Choices Stats: 2025-12-04T09:41:43.1461341Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1461441Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1461534Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1461639Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1462126Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1462596Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1463060Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1463525Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1463988Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1464499Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1464977Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1465446Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1465917Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1466421Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1466758Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.1466851Z Autotune Choices Stats: 2025-12-04T09:41:43.1467722Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1467825Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1467913Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1468023Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1468503Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1468969Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1469535Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1470000Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1470467Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1470931Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1471408Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1471886Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1472359Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1472834Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1473165Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.1473344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1473483Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1473615Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1473874Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1474817Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1474906Z graph_break [] 2025-12-04T09:41:43.1475011Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1475230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1475325Z Autotune Choices Stats: 2025-12-04T09:41:43.1476158Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.1476258Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1476349Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1476456Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1476939Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1477414Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1477885Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1478364Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1478915Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1479386Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1479897Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1480378Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1480855Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1481335Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1481666Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.1481760Z Autotune Choices Stats: 2025-12-04T09:41:43.1482590Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1482733Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1482823Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1482928Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1483409Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1483886Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1484359Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1484883Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1485353Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1485822Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1486289Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1486760Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1487238Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1487712Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1488124Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.1488219Z Autotune Choices Stats: 2025-12-04T09:41:43.1489071Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1489176Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1489264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1489377Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1489862Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1490345Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1490814Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1491279Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1491758Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1492224Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1492772Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1493249Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1493720Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1494247Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1494577Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.1494753Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1494849Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1494985Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1495236Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1496180Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1496269Z graph_break [] 2025-12-04T09:41:43.1496375Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1496547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1496645Z Autotune Choices Stats: 2025-12-04T09:41:43.1497612Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1497710Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1497800Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1497905Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1498385Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1498864Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1499344Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1499831Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1500436Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1500923Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1501398Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1501951Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1502422Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1502896Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1503231Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.1503385Z Autotune Choices Stats: 2025-12-04T09:41:43.1504217Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1504315Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1504401Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1504512Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1504991Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1505471Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1505948Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1506425Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1507017Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1507503Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1508008Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1508485Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1508963Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1509440Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1509768Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.1509863Z Autotune Choices Stats: 2025-12-04T09:41:43.1510684Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1510781Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1510910Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1511023Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1511507Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1511979Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1512451Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1512930Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1513459Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1513942Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1514426Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1514896Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1515370Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1515841Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1516255Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.1516429Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1516527Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1516658Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1516905Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1518330Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1518423Z graph_break [] 2025-12-04T09:41:43.1518528Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1518705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1518803Z Autotune Choices Stats: 2025-12-04T09:41:43.1519716Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1519816Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1519907Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1520016Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1520584Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1521194Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1521752Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1522321Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1522924Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1523494Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1524064Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1524637Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1525205Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1525763Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1526156Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.1526253Z Autotune Choices Stats: 2025-12-04T09:41:43.1527391Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1527514Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1527605Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1527719Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1528286Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1528850Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1529409Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1529959Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1530513Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1531064Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1531621Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1532227Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1532788Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1533346Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1533769Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.1533948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1534049Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1534187Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1534435Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1535829Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1535922Z graph_break [] 2025-12-04T09:41:43.1536032Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1536211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1536302Z Autotune Choices Stats: 2025-12-04T09:41:43.1537229Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1537338Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1537429Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1537539Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1538033Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1538549Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1539029Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1539508Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1539987Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1540453Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1540941Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1541408Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1541924Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1542399Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1542738Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.1542835Z Autotune Choices Stats: 2025-12-04T09:41:43.1543717Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1543814Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1543910Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1544015Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1544505Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1544980Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1545456Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1545935Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1546403Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1546951Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1547427Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1547906Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1548383Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1548860Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1549198Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.1549371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1549471Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1549604Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1549854Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1551231Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1551361Z graph_break [] 2025-12-04T09:41:43.1551471Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1551648Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1551741Z Autotune Choices Stats: 2025-12-04T09:41:43.1552584Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.1552725Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1552821Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1552928Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1553409Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1553895Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1554371Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1554853Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1555333Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1555824Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1556381Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1556868Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1557376Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1557849Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1558190Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.1558286Z Autotune Choices Stats: 2025-12-04T09:41:43.1559143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1559246Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1559339Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1559454Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1559993Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1560514Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1560997Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1561474Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1561948Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1562457Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1562936Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1563415Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1563887Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1564368Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1564706Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.1564889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1564986Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1565118Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1565472Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1566862Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1566954Z graph_break [] 2025-12-04T09:41:43.1567062Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1567238Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1567339Z Autotune Choices Stats: 2025-12-04T09:41:43.1568223Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1568323Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1568413Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1568520Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1569003Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1569481Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1570003Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1570488Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1570968Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1571453Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1571979Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1572460Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1572936Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1573414Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1573747Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.1573843Z Autotune Choices Stats: 2025-12-04T09:41:43.1574684Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1574782Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1574957Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1575064Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1575542Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1576018Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1576486Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1576970Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1577448Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1577919Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1578387Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1578867Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1579452Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1579931Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1580269Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.1580444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1580539Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1580719Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1580967Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1582346Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1582430Z graph_break [] 2025-12-04T09:41:43.1582538Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1582716Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1582808Z Autotune Choices Stats: 2025-12-04T09:41:43.1583658Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1583758Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1583846Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1583959Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1584533Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1585020Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1585489Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1585960Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1586440Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1586913Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1587390Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1587909Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1588384Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1588903Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1589239Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.1589338Z Autotune Choices Stats: 2025-12-04T09:41:43.1590183Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1590325Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1590414Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1590520Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1591002Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1591481Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1591958Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1592425Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1592892Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1593363Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1593913Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1594392Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1594865Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1595343Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1595674Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.1595849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1595949Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1596086Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1596338Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1597284Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1597370Z graph_break [] 2025-12-04T09:41:43.1597480Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1597678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1597835Z Autotune Choices Stats: 2025-12-04T09:41:43.1598674Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1598771Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1598865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1598972Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1599450Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1600051Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1600650Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1601132Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1601606Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1602084Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1602565Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1603047Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1603663Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1604138Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1604479Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.1604577Z Autotune Choices Stats: 2025-12-04T09:41:43.1605410Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1605508Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1605598Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1605716Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1606193Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1606665Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1607144Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1607666Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1608202Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1608670Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1609143Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1609669Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1610145Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1610627Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1610958Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.1611059Z Autotune Choices Stats: 2025-12-04T09:41:43.1611891Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1611992Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1612083Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1612196Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1612768Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1613245Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1613722Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1614206Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1614684Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1615172Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1615650Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1616135Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1616620Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1617199Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1617532Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.1617705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1617806Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1617938Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1618184Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1619133Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1619262Z graph_break [] 2025-12-04T09:41:43.1619372Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1619548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1619639Z Autotune Choices Stats: 2025-12-04T09:41:43.1620471Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1620566Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1620657Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1620765Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1621248Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1621737Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1622306Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1622802Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1623270Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1623750Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1624221Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1624692Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1625173Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1625642Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1625981Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.1626117Z Autotune Choices Stats: 2025-12-04T09:41:43.1626953Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1627054Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1627144Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1627264Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1627781Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1628296Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1628768Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1629239Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1629710Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1630178Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1630649Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1631124Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1631677Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1632155Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1632489Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.1632591Z Autotune Choices Stats: 2025-12-04T09:41:43.1633410Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1633509Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1633604Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1633717Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1634199Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1634671Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1635147Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1635633Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1636179Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1636664Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1637151Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1637676Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1638144Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1638617Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1638958Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.1639131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1639233Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1639367Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1639673Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1640619Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1640707Z graph_break [] 2025-12-04T09:41:43.1640817Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1641086Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1641179Z Autotune Choices Stats: 2025-12-04T09:41:43.1642022Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1642119Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1642212Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1642319Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1642803Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1643289Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1643764Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1644244Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1644731Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1645248Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1645725Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1646201Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1646674Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1647183Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1647569Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.1647671Z Autotune Choices Stats: 2025-12-04T09:41:43.1648501Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1648601Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1648689Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1648802Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1649280Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1649764Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1650317Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1650786Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1651259Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1651726Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1652192Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1652688Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1653162Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1653641Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1653974Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.1654071Z Autotune Choices Stats: 2025-12-04T09:41:43.1654910Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1655053Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1655142Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1655255Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1655744Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1656214Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1656728Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1657217Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1657683Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1658207Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1658686Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1659172Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1659723Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1660202Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1660536Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.1660709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1660816Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1664359Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1664631Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1666019Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1666104Z graph_break [] 2025-12-04T09:41:43.1666214Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1666393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1666491Z Autotune Choices Stats: 2025-12-04T09:41:43.1667333Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1667494Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1667588Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1667698Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1668243Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1668806Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1669372Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1669902Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1670378Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1670847Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1671324Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1671793Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1672272Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1672857Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1673192Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.1673287Z Autotune Choices Stats: 2025-12-04T09:41:43.1674126Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1674228Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1674316Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1674425Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1674908Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1675380Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1675851Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1676317Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1676789Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1677307Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1677825Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1678300Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1678772Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1679290Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1679696Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.1679878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1679974Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1680107Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1680355Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1681729Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1681821Z graph_break [] 2025-12-04T09:41:43.1681926Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1682100Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1682283Z Autotune Choices Stats: 2025-12-04T09:41:43.1683126Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1683224Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1683312Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1683421Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1683906Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1684378Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1684853Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1685319Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1685787Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1686264Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1686779Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1687257Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1687763Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1688251Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1688624Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.1688721Z Autotune Choices Stats: 2025-12-04T09:41:43.1689557Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1689652Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1689743Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1689848Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1690322Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1690798Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1691265Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1691823Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1692291Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1692756Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1693229Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1693704Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1694181Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1694653Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1694989Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.1695165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1695260Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1695394Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1695683Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1697070Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1697157Z graph_break [] 2025-12-04T09:41:43.1697263Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1697479Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1697572Z Autotune Choices Stats: 2025-12-04T09:41:43.1698431Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1698527Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1698618Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1698727Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1699213Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1699696Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1700188Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1700826Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1701459Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1701936Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1702415Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1702891Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1703366Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1703849Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1704180Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.1704278Z Autotune Choices Stats: 2025-12-04T09:41:43.1705117Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1705277Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1705364Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1705471Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1705967Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1706437Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1706908Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1707486Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1707958Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1708439Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1708908Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1709387Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1709867Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1710351Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1710789Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.1710970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1711070Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1711203Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1711450Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1712840Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1712930Z graph_break [] 2025-12-04T09:41:43.1713039Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1713224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1713316Z Autotune Choices Stats: 2025-12-04T09:41:43.1714159Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1714257Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1714343Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1714449Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1714930Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1715470Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1715948Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1716431Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1716955Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1717449Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1717974Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1718449Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1718920Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1719390Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1719787Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.1719883Z Autotune Choices Stats: 2025-12-04T09:41:43.1720819Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1720917Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1721005Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1721113Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1721599Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1722076Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1722562Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1723036Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1723509Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1723979Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1724453Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1724983Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1725460Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1725941Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1726314Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.1726492Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1726593Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1726725Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1726975Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1727918Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1728005Z graph_break [] 2025-12-04T09:41:43.1728109Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1728283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1728381Z Autotune Choices Stats: 2025-12-04T09:41:43.1729226Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.1729328Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1729560Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1729669Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1730157Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1730627Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1731102Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1731582Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1732059Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1732542Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1733012Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1733499Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1734018Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1734507Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1734835Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.1734929Z Autotune Choices Stats: 2025-12-04T09:41:43.1735764Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1735898Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1735991Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1736097Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1736581Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1737083Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1737583Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1738061Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1738530Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1739086Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1739557Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1740033Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1740514Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1740995Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1741334Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.1741429Z Autotune Choices Stats: 2025-12-04T09:41:43.1742276Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1742375Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1742465Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1742579Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1743054Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1743585Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1744061Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1744533Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1745086Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1745565Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1746052Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1746537Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1747010Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1747485Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1747852Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.1748047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1748143Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1748352Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1748600Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1749973Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1750063Z graph_break [] 2025-12-04T09:41:43.1750168Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1750346Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1750442Z Autotune Choices Stats: 2025-12-04T09:41:43.1751290Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1751387Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1751474Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1751580Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1752068Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1752546Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1753069Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1753546Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1754023Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1754541Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1755015Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1755493Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1755968Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1756446Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1756778Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.1756870Z Autotune Choices Stats: 2025-12-04T09:41:43.1757716Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1757891Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1757982Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1758087Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1758572Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1759060Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1759594Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1760068Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1760533Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1760999Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1761464Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1761938Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1762457Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1762927Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1763257Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.1763428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1763564Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1763698Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1763939Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1765314Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1765395Z graph_break [] 2025-12-04T09:41:43.1765496Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1765674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1765766Z Autotune Choices Stats: 2025-12-04T09:41:43.1766616Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1766712Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1766797Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1766930Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1767515Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1768005Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1768472Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1768951Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1769435Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1769901Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1770372Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1770847Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1771315Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1771826Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1772155Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.1772249Z Autotune Choices Stats: 2025-12-04T09:41:43.1773084Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1773228Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1773317Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1773422Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1773903Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1774374Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1774843Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1775308Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1775776Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1776243Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1776791Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1777267Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1777762Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1778262Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1778595Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.1778819Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.1778925Z Traceback (most recent call last): 2025-12-04T09:41:43.1779341Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.1779526Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.1779871Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.1780054Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.1780218Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.1780303Z Searched string: 2025-12-04T09:41:43.1780504Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.1780509Z 2025-12-04T09:41:43.1780630Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.1780635Z 2025-12-04T09:41:43.1780770Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.1780898Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.1780902Z 2025-12-04T09:41:43.1780996Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.1781085Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.1781182Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1781274Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.1781278Z 2025-12-04T09:41:43.1781366Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.1781499Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.1781591Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1781683Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.1781687Z 2025-12-04T09:41:43.1781691Z 2025-12-04T09:41:43.1781852Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.1781857Z 2025-12-04T09:41:43.1781860Z 2025-12-04T09:41:43.1781979Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.1782099Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.1782212Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.1782299Z idx_m = rm[:, None] 2025-12-04T09:41:43.1782385Z idx_n = rn[None, :] 2025-12-04T09:41:43.1782478Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.1782482Z 2025-12-04T09:41:43.1782583Z # inductor generates a suffix 2025-12-04T09:41:43.1782673Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.1782883Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.1782979Z ''', device_str='cuda') 2025-12-04T09:41:43.1782984Z 2025-12-04T09:41:43.1782987Z 2025-12-04T09:41:43.1783085Z async_compile.wait(globals()) 2025-12-04T09:41:43.1783172Z del async_compile 2025-12-04T09:41:43.1783179Z 2025-12-04T09:41:43.1783256Z class Runner: 2025-12-04T09:41:43.1783357Z def __init__(self, partitions): 2025-12-04T09:41:43.1783461Z self.partitions = partitions 2025-12-04T09:41:43.1783466Z 2025-12-04T09:41:43.1783658Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.1783750Z new_callables = [] 2025-12-04T09:41:43.1783869Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.1783974Z new_callables.append(fn(c)) 2025-12-04T09:41:43.1784079Z self.partitions = new_callables 2025-12-04T09:41:43.1784083Z 2025-12-04T09:41:43.1784172Z def call(self, args): 2025-12-04T09:41:43.1784261Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.1784346Z args.clear() 2025-12-04T09:41:43.1784476Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.1784601Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.1784711Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.1784810Z torch.cuda.set_device(0) 2025-12-04T09:41:43.1784975Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.1785194Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.1785295Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.1785487Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.1785571Z del arg0_1 2025-12-04T09:41:43.1785732Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.1785987Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.1786086Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.1786304Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.1786390Z del arg1_1 2025-12-04T09:41:43.1786468Z del buf0 2025-12-04T09:41:43.1786599Z return (buf1, ) 2025-12-04T09:41:43.1786603Z 2025-12-04T09:41:43.1786703Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.1786786Z call = runner.call 2025-12-04T09:41:43.1786947Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.1786959Z 2025-12-04T09:41:43.1786963Z 2025-12-04T09:41:43.1787100Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.1787231Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.1787381Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.1787580Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.1787782Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.1787928Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.1788090Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.1788095Z 2025-12-04T09:41:43.1788101Z 2025-12-04T09:41:43.1788200Z if __name__ == "__main__": 2025-12-04T09:41:43.1788401Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.1788560Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.1788656Z From CHECK: .to( 2025-12-04T09:41:43.1788661Z 2025-12-04T09:41:43.1788664Z 2025-12-04T09:41:43.1788839Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.1789401Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.1789406Z 2025-12-04T09:41:43.1789623Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.1789809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1789904Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1790039Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1790295Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1791754Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1791847Z graph_break [] 2025-12-04T09:41:43.1791953Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1792127Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1792230Z Autotune Choices Stats: 2025-12-04T09:41:43.1793072Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1793174Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1793264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1793372Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1793859Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1794316Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1794782Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1795286Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1795746Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1796220Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1796678Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1797236Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1797695Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1798163Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1798501Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.1798594Z Autotune Choices Stats: 2025-12-04T09:41:43.1799428Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1799600Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1799689Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1799796Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1800525Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1800995Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1801450Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1801914Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1802369Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1802834Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1803292Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1803756Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1804226Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1804690Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1805073Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.1805257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1805351Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1805489Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1805735Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1806677Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1806825Z graph_break [] 2025-12-04T09:41:43.1806930Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1807111Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1807222Z Autotune Choices Stats: 2025-12-04T09:41:43.1808092Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1808188Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1808273Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1808376Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1808854Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1809332Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1809921Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1810399Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1810881Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1811363Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1811832Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1812309Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1812776Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1813239Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1813571Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.1813663Z Autotune Choices Stats: 2025-12-04T09:41:43.1814534Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1814627Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1814717Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1814819Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1815291Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1815751Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1816282Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1816760Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1817221Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1817681Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1818145Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1818616Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1819165Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1819636Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1819969Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.1820059Z Autotune Choices Stats: 2025-12-04T09:41:43.1820895Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.1820990Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1821074Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1821190Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1821672Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1822143Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1822605Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1823075Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1823586Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1824059Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1824527Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1824993Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1825513Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1825972Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1826299Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.1826477Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1826570Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1826708Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1826955Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1827944Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1828031Z graph_break [] 2025-12-04T09:41:43.1828134Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1828388Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1828485Z Autotune Choices Stats: 2025-12-04T09:41:43.1829316Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1829413Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1829497Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1829600Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1830082Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1830549Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1831012Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1831472Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1831956Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1832424Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1832945Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1833428Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1833889Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1834402Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1834741Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.1834835Z Autotune Choices Stats: 2025-12-04T09:41:43.1835667Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.1835761Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1835849Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1835953Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1836424Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1836896Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1837392Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1837961Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1838422Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1838884Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1839352Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1839869Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1840344Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1840811Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1841142Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.1841236Z Autotune Choices Stats: 2025-12-04T09:41:43.1842073Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1842211Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1842301Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1842414Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1842891Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1843364Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1843886Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1844353Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1844826Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1845288Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1845755Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1846214Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1846676Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1847254Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1847583Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.1847760Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1847854Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1847986Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1848238Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1849609Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1849698Z graph_break [] 2025-12-04T09:41:43.1849801Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1849973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1850066Z Autotune Choices Stats: 2025-12-04T09:41:43.1850905Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1851044Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1851129Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1851232Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1851719Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1852182Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1852648Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1853150Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1853612Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1854087Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1854554Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1855026Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1855492Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1855956Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1856365Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.1856458Z Autotune Choices Stats: 2025-12-04T09:41:43.1857361Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1857459Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1857550Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1857652Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1858129Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1858615Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1859089Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1859555Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1860020Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1860485Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1861001Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1861473Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1861949Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1862463Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1862800Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.1862974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1863073Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1863206Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1863449Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1864389Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1864473Z graph_break [] 2025-12-04T09:41:43.1864576Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1864752Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1864845Z Autotune Choices Stats: 2025-12-04T09:41:43.1865780Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.1865875Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1865959Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1866067Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1866547Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1867025Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1867508Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1868013Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1868480Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1868947Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1869415Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1869928Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1870411Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1870882Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1871211Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.1871345Z Autotune Choices Stats: 2025-12-04T09:41:43.1872171Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1872269Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1872354Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1872462Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1872939Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1873410Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1873888Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1874353Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1874955Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1875427Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1875891Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1876369Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1876847Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1877324Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1877650Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.1877740Z Autotune Choices Stats: 2025-12-04T09:41:43.1878581Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1878675Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1878807Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1878920Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1879415Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1879946Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1880411Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1880947Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1881419Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1881898Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1882370Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1882845Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1883327Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1883803Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1884220Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.1884399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1884491Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1884625Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1884872Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1885817Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1885905Z graph_break [] 2025-12-04T09:41:43.1886013Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1886191Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1886280Z Autotune Choices Stats: 2025-12-04T09:41:43.1887109Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1887208Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1887294Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1887424Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1887934Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1888406Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1888930Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1889407Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1889887Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1890408Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1890893Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1891383Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1891851Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1892330Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1892660Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.1892759Z Autotune Choices Stats: 2025-12-04T09:41:43.1893673Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.1893773Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1893862Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1893968Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1894448Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1894926Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1895399Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1895884Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1896360Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1896841Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1897333Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1897840Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1898361Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1898839Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1899175Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.1899311Z Autotune Choices Stats: 2025-12-04T09:41:43.1900143Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1900237Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1900599Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1900713Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.1901191Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1901672Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1902145Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1902630Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1903242Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1903810Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1904385Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1904939Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1905498Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1906059Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1906447Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.1906642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1906740Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1906882Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1907166Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1908888Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1909032Z graph_break [] 2025-12-04T09:41:43.1909138Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1909336Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1909431Z Autotune Choices Stats: 2025-12-04T09:41:43.1910425Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1910585Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1910675Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1910788Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1911354Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1911911Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1912473Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1913035Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1913606Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1914251Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1914823Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1915392Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.1915963Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1916524Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1916915Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.1917012Z Autotune Choices Stats: 2025-12-04T09:41:43.1918060Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1918160Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1918251Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1918357Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1918926Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1919602Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1920160Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1920715Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1921264Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1921858Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1922415Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1922974Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1923529Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1924088Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1924472Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.1924670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1924778Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1924999Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1925281Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1926999Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1927086Z graph_break [] 2025-12-04T09:41:43.1927194Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1927394Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1927485Z Autotune Choices Stats: 2025-12-04T09:41:43.1928554Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1928650Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1932145Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1932270Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1932773Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1933259Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1933811Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1934287Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1934754Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1935224Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1935741Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1936213Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1936688Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1937160Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1937495Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.1937589Z Autotune Choices Stats: 2025-12-04T09:41:43.1938426Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1938613Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1938704Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1938815Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1939297Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1939773Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1940265Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1940748Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1941223Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1941689Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1942159Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1942636Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1943153Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1943635Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1943965Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.1944144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1944283Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1944417Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1944667Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1946046Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1946136Z graph_break [] 2025-12-04T09:41:43.1946243Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1946418Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1946515Z Autotune Choices Stats: 2025-12-04T09:41:43.1947352Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.1947469Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1947567Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1947694Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1948259Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1948735Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1949216Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1949693Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1950175Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1950661Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1951138Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1951608Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1952077Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1952593Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1952923Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.1953021Z Autotune Choices Stats: 2025-12-04T09:41:43.1953860Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1953996Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1954088Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1954195Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1954682Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1955162Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1955637Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1956116Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1956585Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1957056Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1957637Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1958156Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1958632Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1959107Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1959441Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.1959683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1959783Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1959918Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1960164Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1961538Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1961670Z graph_break [] 2025-12-04T09:41:43.1961775Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1961952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1962045Z Autotune Choices Stats: 2025-12-04T09:41:43.1962887Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1962981Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1963068Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1963218Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1963696Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1964176Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1964651Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1965129Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1965612Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1966092Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1966582Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1967137Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1967613Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1968078Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1968416Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.1968517Z Autotune Choices Stats: 2025-12-04T09:41:43.1969349Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.1969446Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1969533Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1969637Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1970114Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1970584Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1971056Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1971576Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1972043Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1972513Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1973038Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1973524Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1974004Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1974479Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1974811Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.1974988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1975088Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1975223Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1975475Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1976929Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1977013Z graph_break [] 2025-12-04T09:41:43.1977124Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1977299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1977398Z Autotune Choices Stats: 2025-12-04T09:41:43.1978292Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1978391Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1978483Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1978595Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1979082Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1979558Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1980030Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1980540Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1981017Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1981486Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1981959Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1982478Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1982952Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1983428Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1983760Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.1983854Z Autotune Choices Stats: 2025-12-04T09:41:43.1984693Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.1984789Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1984880Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1984987Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.1985547Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1986035Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1986507Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1986974Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.1987449Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1987971Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1988449Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1988921Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1989399Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1989911Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1990247Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.1990425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.1990521Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.1990656Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.1990901Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.1991840Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.1991972Z graph_break [] 2025-12-04T09:41:43.1992075Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.1992257Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.1992350Z Autotune Choices Stats: 2025-12-04T09:41:43.1993183Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.1993282Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1993370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.1993474Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.1993968Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.1994438Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1995015Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.1995484Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.1995959Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1996431Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1996906Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.1997394Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1997872Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1998343Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.1998677Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.1998775Z Autotune Choices Stats: 2025-12-04T09:41:43.1999716Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.1999814Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.1999904Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2000009Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2000658Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2001221Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2001692Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2002172Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2002644Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2003115Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2003585Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2004064Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2004656Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2005135Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2005473Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.2005569Z Autotune Choices Stats: 2025-12-04T09:41:43.2006427Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2006525Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2006617Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2006731Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2007229Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2007752Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2008226Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2008712Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2009259Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2009738Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2010219Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2010738Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2011227Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2011718Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2012046Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.2012224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2012319Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2012452Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2012706Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2013650Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2013737Z graph_break [] 2025-12-04T09:41:43.2013844Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2014102Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2014198Z Autotune Choices Stats: 2025-12-04T09:41:43.2015034Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2015132Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2015219Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2015325Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2015803Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2016295Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2016779Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2017262Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2017791Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2018311Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2018786Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2019258Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2019726Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2020239Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2020577Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.2020669Z Autotune Choices Stats: 2025-12-04T09:41:43.2021507Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2021601Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2021692Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2021799Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2022276Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2022757Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2023311Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2023787Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2024256Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2024730Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2025202Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2025685Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2026165Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2026641Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2026980Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.2027075Z Autotune Choices Stats: 2025-12-04T09:41:43.2027911Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2028055Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2028142Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2028256Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2028734Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2029207Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2029810Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2030302Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2030784Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2031261Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2031746Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2032228Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2032778Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2033250Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2033583Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.2033763Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2033861Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2033992Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2034241Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2035191Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2035279Z graph_break [] 2025-12-04T09:41:43.2035384Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2035557Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2035653Z Autotune Choices Stats: 2025-12-04T09:41:43.2036492Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2036593Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2036723Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2036829Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2037371Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2037848Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2038325Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2038850Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2039337Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2039908Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2040378Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2040855Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2041325Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2041799Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2042234Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.2042330Z Autotune Choices Stats: 2025-12-04T09:41:43.2043163Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2043262Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2043350Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2043455Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2043930Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2044412Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2044881Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2045354Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2045824Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2046292Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2046818Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2047299Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2047821Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2048343Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2048678Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.2048774Z Autotune Choices Stats: 2025-12-04T09:41:43.2049635Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2049734Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2049821Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2049935Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2050421Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2050896Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2051371Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2051922Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2052397Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2052866Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2053345Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2053825Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2054299Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2054782Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2055114Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.2055291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2055386Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2055561Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2055814Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2057204Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2057291Z graph_break [] 2025-12-04T09:41:43.2057441Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2057614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2057709Z Autotune Choices Stats: 2025-12-04T09:41:43.2058542Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2058647Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2058733Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2058838Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2059321Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2059795Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2060280Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2060755Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2061301Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2061772Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2062247Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2062719Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2063203Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2063679Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2064010Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.2064104Z Autotune Choices Stats: 2025-12-04T09:41:43.2064963Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2065124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2065214Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2065320Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2065805Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2066275Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2066742Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2067291Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2067771Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2068241Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2068720Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2069194Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2069671Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2070146Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2070562Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.2070739Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2070834Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2070969Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2071218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2072604Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2072690Z graph_break [] 2025-12-04T09:41:43.2072800Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2072979Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2073072Z Autotune Choices Stats: 2025-12-04T09:41:43.2073923Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2074025Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2074113Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2074222Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2074755Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2075239Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2075711Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2076179Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2076700Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2077178Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2077686Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2078185Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2078658Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2079141Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2079542Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.2079642Z Autotune Choices Stats: 2025-12-04T09:41:43.2080570Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2080673Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2080761Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2080866Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2081350Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2081824Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2082309Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2082786Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2083254Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2083730Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2084236Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2084729Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2085208Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2085687Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2086058Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.2086236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2086337Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2086470Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2086724Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2088100Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2088189Z graph_break [] 2025-12-04T09:41:43.2088303Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2088481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2088583Z Autotune Choices Stats: 2025-12-04T09:41:43.2089513Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2089609Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2089699Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2089805Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2090294Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2090777Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2091261Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2091751Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2092234Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2092715Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2093199Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2093714Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2094202Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2094671Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2095013Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.2095152Z Autotune Choices Stats: 2025-12-04T09:41:43.2095996Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2096095Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2096186Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2096301Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2096782Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2097310Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2097783Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2098253Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2098835Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2099305Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2099777Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2100556Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2101046Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2101527Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2101862Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.2102042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2102138Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2102277Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2102525Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2103907Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2104073Z graph_break [] 2025-12-04T09:41:43.2104178Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2104359Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2104451Z Autotune Choices Stats: 2025-12-04T09:41:43.2105279Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2105436Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2105529Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2105637Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2106123Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2106610Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2107112Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2107626Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2108111Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2108704Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2109176Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2109654Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2110129Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2110607Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2110950Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.2111053Z Autotune Choices Stats: 2025-12-04T09:41:43.2111894Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2111991Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2112083Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2112191Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2112676Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2113201Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2113677Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2114157Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2114667Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2115142Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2115627Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2116109Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2116585Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2117063Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2117403Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.2117579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2117679Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2117893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2118144Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2119094Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2119180Z graph_break [] 2025-12-04T09:41:43.2119288Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2119464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2119616Z Autotune Choices Stats: 2025-12-04T09:41:43.2120485Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.2120583Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2120675Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2120785Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2121273Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2121749Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2122576Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2123058Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2123532Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2124014Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2124530Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2125006Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2125485Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2125960Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2126293Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.2126393Z Autotune Choices Stats: 2025-12-04T09:41:43.2127234Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2127365Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2127466Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2127662Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2128154Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2128634Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2129110Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2129582Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2130068Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2130535Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2131003Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2131485Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2132000Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2132492Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2132826Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.2132922Z Autotune Choices Stats: 2025-12-04T09:41:43.2133753Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2133890Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2133983Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2134098Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2134574Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2135055Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2135529Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2136007Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2136489Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2137080Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2137612Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2138096Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2138578Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2139049Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2139387Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.2139568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2139663Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2139799Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2140051Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2141439Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2141568Z graph_break [] 2025-12-04T09:41:43.2141677Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2141857Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2141956Z Autotune Choices Stats: 2025-12-04T09:41:43.2142816Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2142914Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2143046Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2143157Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2143643Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2144131Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2144615Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2145093Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2145581Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2146051Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2146612Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2147084Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2147564Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2148043Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2148379Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.2148481Z Autotune Choices Stats: 2025-12-04T09:41:43.2149338Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2149441Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2149530Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2149636Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2150122Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2150600Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2151121Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2151594Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2152066Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2152534Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2153045Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2153544Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2154017Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2154495Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2154833Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.2155016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2155118Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2155252Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2155503Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2156969Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2157072Z graph_break [] 2025-12-04T09:41:43.2157201Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2157382Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2157480Z Autotune Choices Stats: 2025-12-04T09:41:43.2158322Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2158428Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2158518Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2158627Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2159120Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2159690Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2160163Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2160687Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2161170Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2161651Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2162121Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2162640Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2163116Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2163594Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2163928Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.2164022Z Autotune Choices Stats: 2025-12-04T09:41:43.2164863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2164963Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2165051Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2165163Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2165724Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2166197Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2166665Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2167135Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2167642Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2168136Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2168610Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2169086Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2169571Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2170083Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2170418Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.2170599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2170696Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2170839Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2171086Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2172545Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2172636Z graph_break [] 2025-12-04T09:41:43.2172744Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2172927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2173018Z Autotune Choices Stats: 2025-12-04T09:41:43.2173856Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2173958Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2174046Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2174154Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2174635Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2175205Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2175694Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2176174Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2176668Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2177142Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2177621Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2178105Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2178584Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2179058Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2179498Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.2179597Z Autotune Choices Stats: 2025-12-04T09:41:43.2180443Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2180539Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2180634Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2180739Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2181269Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2181748Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2182221Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2182705Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2183174Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2183658Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2184129Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2184688Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2185167Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2185641Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2185978Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.2186206Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.2186317Z Traceback (most recent call last): 2025-12-04T09:41:43.2186731Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.2186917Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.2187311Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.2187500Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.2187670Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.2187758Z Searched string: 2025-12-04T09:41:43.2187895Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.2187901Z 2025-12-04T09:41:43.2188024Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.2188029Z 2025-12-04T09:41:43.2188160Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.2188329Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.2188338Z 2025-12-04T09:41:43.2188431Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.2188521Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.2188626Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.2188720Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.2188724Z 2025-12-04T09:41:43.2188814Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.2188912Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.2189006Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.2189096Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.2189100Z 2025-12-04T09:41:43.2189110Z 2025-12-04T09:41:43.2189315Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.2189320Z 2025-12-04T09:41:43.2189324Z 2025-12-04T09:41:43.2189444Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.2189565Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.2189683Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.2189771Z idx_m = rm[:, None] 2025-12-04T09:41:43.2189861Z idx_n = rn[None, :] 2025-12-04T09:41:43.2189956Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.2189964Z 2025-12-04T09:41:43.2190068Z # inductor generates a suffix 2025-12-04T09:41:43.2190159Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.2190372Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.2190467Z ''', device_str='cuda') 2025-12-04T09:41:43.2190471Z 2025-12-04T09:41:43.2190475Z 2025-12-04T09:41:43.2190574Z async_compile.wait(globals()) 2025-12-04T09:41:43.2190658Z del async_compile 2025-12-04T09:41:43.2190665Z 2025-12-04T09:41:43.2190750Z class Runner: 2025-12-04T09:41:43.2190854Z def __init__(self, partitions): 2025-12-04T09:41:43.2190962Z self.partitions = partitions 2025-12-04T09:41:43.2190967Z 2025-12-04T09:41:43.2191080Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.2191171Z new_callables = [] 2025-12-04T09:41:43.2191294Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.2191400Z new_callables.append(fn(c)) 2025-12-04T09:41:43.2191594Z self.partitions = new_callables 2025-12-04T09:41:43.2191599Z 2025-12-04T09:41:43.2191694Z def call(self, args): 2025-12-04T09:41:43.2191784Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.2191867Z args.clear() 2025-12-04T09:41:43.2192000Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.2192126Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.2192242Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.2192342Z torch.cuda.set_device(0) 2025-12-04T09:41:43.2192509Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.2192736Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.2192836Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.2193025Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.2193111Z del arg0_1 2025-12-04T09:41:43.2193276Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.2193534Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.2193630Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.2193845Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.2193930Z del arg1_1 2025-12-04T09:41:43.2194008Z del buf0 2025-12-04T09:41:43.2194096Z return (buf1, ) 2025-12-04T09:41:43.2194100Z 2025-12-04T09:41:43.2194204Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.2194288Z call = runner.call 2025-12-04T09:41:43.2194445Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.2194502Z 2025-12-04T09:41:43.2194506Z 2025-12-04T09:41:43.2194650Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.2194778Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.2194928Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.2195129Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.2195328Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.2195431Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.2195591Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.2195596Z 2025-12-04T09:41:43.2195641Z 2025-12-04T09:41:43.2195737Z if __name__ == "__main__": 2025-12-04T09:41:43.2195934Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.2199651Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.2199756Z From CHECK: .to( 2025-12-04T09:41:43.2199761Z 2025-12-04T09:41:43.2199765Z 2025-12-04T09:41:43.2199945Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.2200653Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.2200659Z 2025-12-04T09:41:43.2200877Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.2201055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2201153Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2201282Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2201533Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2203074Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2203162Z graph_break [] 2025-12-04T09:41:43.2203270Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2203445Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2203538Z Autotune Choices Stats: 2025-12-04T09:41:43.2204382Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2204477Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2204569Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2204675Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2205162Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2205628Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2206100Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2206567Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2207022Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2207600Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2208056Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2208511Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2209039Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2209505Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2209840Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.2209931Z Autotune Choices Stats: 2025-12-04T09:41:43.2210761Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2210853Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2210942Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2211049Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2211524Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2211985Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2212524Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2212980Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2213440Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2213897Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2214358Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2214822Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2215285Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2215751Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2216084Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.2216303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2216397Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2216525Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2216785Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2217777Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2217902Z graph_break [] 2025-12-04T09:41:43.2218006Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2218176Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2218268Z Autotune Choices Stats: 2025-12-04T09:41:43.2219090Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2219187Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2219272Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2219374Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2219848Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2220314Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2220790Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2221351Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2221828Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2222312Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2222778Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2223252Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2223725Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2224188Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2224515Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.2224606Z Autotune Choices Stats: 2025-12-04T09:41:43.2225435Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2225569Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2225657Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2225760Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2226236Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2226700Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2227170Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2227680Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2228197Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2228659Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2229122Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2229590Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2230065Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2230618Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2230953Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.2231043Z Autotune Choices Stats: 2025-12-04T09:41:43.2231867Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.2231972Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2232056Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2232168Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2232646Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2233107Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2233568Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2234035Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2234513Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2235022Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2235492Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2235964Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2236426Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2236929Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2237263Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.2237441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2237534Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2237662Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2237912Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2238850Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2238942Z graph_break [] 2025-12-04T09:41:43.2239046Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2239221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2239312Z Autotune Choices Stats: 2025-12-04T09:41:43.2240329Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2240427Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2240513Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2240615Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2241098Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2241565Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2242027Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2242497Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2242967Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2243441Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2243909Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2244433Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2244892Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2245353Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2245685Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.2245820Z Autotune Choices Stats: 2025-12-04T09:41:43.2246645Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.2246739Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2246827Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2246950Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2247453Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2247930Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2248392Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2248859Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2249403Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2249864Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2250326Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2250795Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2251267Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2251739Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2252071Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.2252162Z Autotune Choices Stats: 2025-12-04T09:41:43.2252995Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2253092Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2253218Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2253328Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2253808Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2254279Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2254757Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2255273Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2255746Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2256213Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2256672Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2257134Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2257627Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2258116Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2258527Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.2258704Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2258797Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2258925Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2259171Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2260544Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2260634Z graph_break [] 2025-12-04T09:41:43.2260736Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2260912Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2261004Z Autotune Choices Stats: 2025-12-04T09:41:43.2261842Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2261937Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2262024Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2262127Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2262608Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2263118Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2263576Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2264036Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2264539Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2265011Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2265481Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2265952Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2266420Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2266883Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2267222Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.2267313Z Autotune Choices Stats: 2025-12-04T09:41:43.2268219Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2268313Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2268398Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2268504Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2268980Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2269452Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2269931Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2270393Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2270859Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2271331Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2271807Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2272321Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2272795Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2273264Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2273632Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.2273809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2273905Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2274037Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2274280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2275221Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2275306Z graph_break [] 2025-12-04T09:41:43.2275409Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2275584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2275675Z Autotune Choices Stats: 2025-12-04T09:41:43.2276508Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.2276606Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2276691Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2276898Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2277400Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2277907Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2278383Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2278849Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2279321Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2279837Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2280303Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2280777Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2281291Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2281769Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2282097Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.2282190Z Autotune Choices Stats: 2025-12-04T09:41:43.2283015Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2283148Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2283239Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2283341Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2283821Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2284291Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2284762Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2285230Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2285696Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2286239Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2286705Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2287178Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2287690Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2288173Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2288503Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.2288597Z Autotune Choices Stats: 2025-12-04T09:41:43.2289434Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2289526Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2289612Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2289723Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2290202Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2290721Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2291188Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2291651Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2292127Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2292631Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2293104Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2293574Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2294045Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2294512Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2294843Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.2295021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2295113Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2295322Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2295566Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2296504Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2296591Z graph_break [] 2025-12-04T09:41:43.2296693Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2296867Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2296956Z Autotune Choices Stats: 2025-12-04T09:41:43.2297788Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2297883Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2297966Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2298068Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2298546Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2299020Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2299497Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2300021Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2300630Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2301117Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2301657Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2302140Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2302614Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2303091Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2303415Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.2303504Z Autotune Choices Stats: 2025-12-04T09:41:43.2304339Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2304432Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2304518Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2304622Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2305210Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2305692Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2306162Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2306639Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2307114Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2307624Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2308095Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2308569Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2309041Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2309572Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2309902Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.2309992Z Autotune Choices Stats: 2025-12-04T09:41:43.2310817Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2310975Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2311059Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2311170Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2311647Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2312121Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2312595Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2313071Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2313551Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2314029Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2314598Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2315065Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2315539Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2316006Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2316335Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.2316520Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2316612Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2316741Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2316984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2318408Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2318536Z graph_break [] 2025-12-04T09:41:43.2318638Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2318813Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2318906Z Autotune Choices Stats: 2025-12-04T09:41:43.2319790Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2319885Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2320012Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2320115Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2320594Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2321068Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2321543Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2322021Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2322498Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2322975Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2323604Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2324095Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2324583Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2325056Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2325388Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.2325480Z Autotune Choices Stats: 2025-12-04T09:41:43.2326320Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2326411Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2326498Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2326600Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2327077Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2327554Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2328063Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2328537Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2329003Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2329469Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2329970Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2330442Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2330917Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2331389Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2331720Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.2331895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2331987Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2332121Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2332360Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2333810Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2333894Z graph_break [] 2025-12-04T09:41:43.2333998Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2334173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2334262Z Autotune Choices Stats: 2025-12-04T09:41:43.2335117Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2335219Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2335303Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2335409Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2335902Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2336381Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2336854Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2337366Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2337885Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2338352Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2338829Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2339335Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2339814Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2340286Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2340618Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.2340708Z Autotune Choices Stats: 2025-12-04T09:41:43.2341538Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2341638Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2341723Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2341825Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2342394Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2342867Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2343344Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2343817Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2344287Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2344755Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2345231Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2345709Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2346181Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2346721Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2347051Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.2347227Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2347320Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2347452Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2347721Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2349157Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2349246Z graph_break [] 2025-12-04T09:41:43.2349352Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2349526Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2349620Z Autotune Choices Stats: 2025-12-04T09:41:43.2350444Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.2350539Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2350632Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2350734Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2351218Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2351768Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2352238Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2352713Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2353187Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2353668Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2354147Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2354617Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2355081Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2355556Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2355938Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.2356031Z Autotune Choices Stats: 2025-12-04T09:41:43.2356892Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2356986Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2357071Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2357176Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2357699Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2358174Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2358653Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2359129Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2359654Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2360124Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2360594Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2361140Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2361623Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2362102Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2362441Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.2362613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2362708Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2362841Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2363086Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2364460Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2364547Z graph_break [] 2025-12-04T09:41:43.2364649Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2364825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2364955Z Autotune Choices Stats: 2025-12-04T09:41:43.2365786Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2365881Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2365966Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2366073Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2366548Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2367091Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2367588Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2368071Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2368551Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2369026Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2369514Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2369988Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2370538Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2371009Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2371340Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.2371437Z Autotune Choices Stats: 2025-12-04T09:41:43.2372265Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2372361Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2372448Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2372553Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2373028Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2373498Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2373964Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2374439Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2374949Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2375419Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2375883Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2376401Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2376881Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2377368Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2377753Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.2377924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2378021Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2378151Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2378399Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2379881Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2379967Z graph_break [] 2025-12-04T09:41:43.2380074Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2380245Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2380335Z Autotune Choices Stats: 2025-12-04T09:41:43.2381186Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2381280Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2381370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2381474Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2381960Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2382435Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2382901Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2383377Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2383841Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2384357Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2384829Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2385300Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2385811Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2386284Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2386622Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.2386714Z Autotune Choices Stats: 2025-12-04T09:41:43.2387547Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2387644Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2387730Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2387836Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2388315Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2388790Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2389347Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2389815Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2390286Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2390754Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2391226Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2391696Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2392167Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2392642Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2392973Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.2393187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2393282Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2393418Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2393666Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2394604Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2394729Z graph_break [] 2025-12-04T09:41:43.2394832Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2395005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2395098Z Autotune Choices Stats: 2025-12-04T09:41:43.2395936Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2396033Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2396118Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2396222Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2396704Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2397218Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2397699Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2398243Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2398719Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2399192Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2399710Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2400189Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2400811Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2401286Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2401612Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.2401706Z Autotune Choices Stats: 2025-12-04T09:41:43.2402542Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2402705Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2402793Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2402897Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2403372Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2403846Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2404318Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2404859Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2405330Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2405794Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2406265Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2406738Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2407249Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2407842Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2408172Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.2408262Z Autotune Choices Stats: 2025-12-04T09:41:43.2409092Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2409191Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2409274Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2409387Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2409871Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2410346Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2410821Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2411296Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2411778Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2412294Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2412778Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2413254Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2413734Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2414259Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2414589Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.2414767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2414860Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2414995Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2415239Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2416179Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2416270Z graph_break [] 2025-12-04T09:41:43.2416379Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2416551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2416648Z Autotune Choices Stats: 2025-12-04T09:41:43.2417588Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2417685Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2417770Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2417873Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2418350Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2418836Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2419317Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2419804Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2420268Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2420743Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2421208Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2421719Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2422183Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2422648Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2423026Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.2423118Z Autotune Choices Stats: 2025-12-04T09:41:43.2423955Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2424053Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2424138Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2424245Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2424717Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2425191Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2425662Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2426136Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2426679Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2427199Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2427667Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2428141Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2428620Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2429092Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2429424Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.2429519Z Autotune Choices Stats: 2025-12-04T09:41:43.2430341Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2430489Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2430573Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2430683Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2431161Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2431633Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2432107Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2432626Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2433107Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2433587Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2434071Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2434549Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2435017Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2435566Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2435896Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.2436072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2436164Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2436295Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2436542Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2437488Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2437593Z graph_break [] 2025-12-04T09:41:43.2437707Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2437901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2437994Z Autotune Choices Stats: 2025-12-04T09:41:43.2438828Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2438920Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2439010Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2439113Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2439640Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2440175Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2440656Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2441141Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2441668Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2442148Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2442621Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2443094Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2443561Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2444027Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2444362Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.2444456Z Autotune Choices Stats: 2025-12-04T09:41:43.2445373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2445467Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2445552Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2445660Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2446133Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2446611Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2447079Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2447549Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2448018Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2448482Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2448952Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2449468Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2449945Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2450416Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2450742Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.2450908Z Autotune Choices Stats: 2025-12-04T09:41:43.2451755Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2451856Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2451945Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2452054Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2452539Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2453017Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2453490Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2453958Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2454507Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2454980Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2455449Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2455929Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2456405Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2456900Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2457263Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.2457434Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2457531Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2457662Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2457909Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2459328Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2459412Z graph_break [] 2025-12-04T09:41:43.2459520Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2459691Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2459784Z Autotune Choices Stats: 2025-12-04T09:41:43.2460615Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2460752Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2460845Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2460947Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2461431Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2461905Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2462387Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2462869Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2463336Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2463880Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2464354Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2464823Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2465296Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2465769Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2466103Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.2466193Z Autotune Choices Stats: 2025-12-04T09:41:43.2467047Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2467143Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2467226Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2467338Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2467870Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2468389Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2468856Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2469320Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2469828Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2470294Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2470766Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2471241Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2471718Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2472190Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2472521Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.2472697Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2472930Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2473064Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2473308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2474682Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2474772Z graph_break [] 2025-12-04T09:41:43.2474876Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2478340Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2478447Z Autotune Choices Stats: 2025-12-04T09:41:43.2479304Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2479402Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2479551Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2479664Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2480154Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2480700Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2481176Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2481643Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2482112Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2482629Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2483107Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2483582Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2484052Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2484529Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2484861Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.2484958Z Autotune Choices Stats: 2025-12-04T09:41:43.2485864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2485961Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2486049Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2486154Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2486639Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2487119Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2487583Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2488067Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2488533Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2489001Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2489469Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2489944Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2490497Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2490971Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2491302Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.2491522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2491621Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2491753Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2491997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2493381Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2493467Z graph_break [] 2025-12-04T09:41:43.2493574Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2493747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2493842Z Autotune Choices Stats: 2025-12-04T09:41:43.2494685Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2494781Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2494869Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2495062Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2495544Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2496024Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2496507Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2496986Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2497489Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2497995Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2498471Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2498939Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2499451Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2499920Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2500430Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.2500527Z Autotune Choices Stats: 2025-12-04T09:41:43.2501384Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2501561Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2501651Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2501766Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2502334Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2502811Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2503279Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2503750Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2504222Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2504796Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2505268Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2505743Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2506217Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2506693Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2507023Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.2507204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2507302Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2507434Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2507715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2509111Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2509257Z graph_break [] 2025-12-04T09:41:43.2509362Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2509540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2509635Z Autotune Choices Stats: 2025-12-04T09:41:43.2510469Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2510566Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2510699Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2510805Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2511285Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2511775Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2512254Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2512730Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2513206Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2513689Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2514234Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2514710Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2515179Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2515650Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2515984Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.2516077Z Autotune Choices Stats: 2025-12-04T09:41:43.2516917Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2517012Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2517104Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2517207Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2517688Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2518176Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2518693Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2519171Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2519681Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2520148Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2520661Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2521140Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2521619Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2522093Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2522423Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.2522596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2522690Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2522826Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2523073Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2524120Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2524206Z graph_break [] 2025-12-04T09:41:43.2524310Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2524485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2524582Z Autotune Choices Stats: 2025-12-04T09:41:43.2525425Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.2525525Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2525612Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2525722Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2526209Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2526678Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2527204Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2527680Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2528196Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2528676Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2529147Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2529659Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2530130Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2530608Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2530940Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.2531034Z Autotune Choices Stats: 2025-12-04T09:41:43.2531867Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2531967Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2532054Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2532162Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2532641Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2533196Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2533673Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2534143Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2534612Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2535087Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2535554Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2536039Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2536519Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2536995Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2537432Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.2537530Z Autotune Choices Stats: 2025-12-04T09:41:43.2538379Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2538472Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2538558Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2538714Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2539192Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2539672Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2540150Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2540624Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2541104Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2541586Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2542069Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2542628Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2543116Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2543585Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2543924Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.2544102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2544198Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2544332Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2544581Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2545961Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2546049Z graph_break [] 2025-12-04T09:41:43.2546154Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2546374Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2546467Z Autotune Choices Stats: 2025-12-04T09:41:43.2547326Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2547425Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2547511Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2547619Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2548103Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2548625Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2549104Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2549579Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2550060Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2550527Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2551007Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2551551Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2552027Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2552503Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2552836Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.2552934Z Autotune Choices Stats: 2025-12-04T09:41:43.2553773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2553874Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2553966Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2554070Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2554552Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2555028Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2555503Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2556014Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2556486Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2556957Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2557473Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2558002Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2558480Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2558956Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2559286Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.2559460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2559610Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2559750Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2559997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2561488Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2561576Z graph_break [] 2025-12-04T09:41:43.2561685Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2561859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2561949Z Autotune Choices Stats: 2025-12-04T09:41:43.2562792Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2562897Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2562986Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2563092Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2563574Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2564056Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2564525Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2565012Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2565532Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2566006Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2566475Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2566945Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2567511Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2567993Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2568326Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.2568422Z Autotune Choices Stats: 2025-12-04T09:41:43.2569262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2569362Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2569449Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2569555Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2570035Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2570588Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2571067Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2571534Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2572005Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2572474Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2572948Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2573422Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2573896Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2574374Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2574743Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.2574920Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2575022Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2575154Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2575399Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2576783Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2576925Z graph_break [] 2025-12-04T09:41:43.2577031Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2577204Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2577310Z Autotune Choices Stats: 2025-12-04T09:41:43.2578183Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2578281Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2578369Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2578474Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2578953Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2579444Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2580015Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2580495Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2580977Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2581453Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2581922Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2582403Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2582876Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2583345Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2583680Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.2583771Z Autotune Choices Stats: 2025-12-04T09:41:43.2584671Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2584765Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2584852Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2584957Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2585440Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2585958Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2586427Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2586904Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2587372Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2587891Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2588359Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2588836Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2589389Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2589863Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2590197Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.2590374Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2590468Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2590602Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2590849Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2592225Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2592308Z graph_break [] 2025-12-04T09:41:43.2592412Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2592590Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2592684Z Autotune Choices Stats: 2025-12-04T09:41:43.2593516Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2593652Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2593739Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2593851Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2594332Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2594803Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2595338Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2595806Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2596282Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2596754Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2597254Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2597757Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2598235Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2598790Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2599124Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.2599219Z Autotune Choices Stats: 2025-12-04T09:41:43.2600100Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2600200Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2600422Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2600529Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2601017Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2601488Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2601961Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2602425Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2602892Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2603437Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2603900Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2604374Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2604898Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2605371Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2605701Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.2605923Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.2606028Z Traceback (most recent call last): 2025-12-04T09:41:43.2606440Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.2606623Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.2606968Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.2607148Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.2607311Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.2607395Z Searched string: 2025-12-04T09:41:43.2607525Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.2607534Z 2025-12-04T09:41:43.2607649Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.2607654Z 2025-12-04T09:41:43.2607885Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.2608013Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.2608018Z 2025-12-04T09:41:43.2608110Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.2608198Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.2608292Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.2608380Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.2608384Z 2025-12-04T09:41:43.2608475Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.2608562Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.2608651Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.2608740Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.2608747Z 2025-12-04T09:41:43.2608751Z 2025-12-04T09:41:43.2608908Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.2608912Z 2025-12-04T09:41:43.2608916Z 2025-12-04T09:41:43.2609034Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.2609154Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.2609265Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.2609351Z idx_m = rm[:, None] 2025-12-04T09:41:43.2609433Z idx_n = rn[None, :] 2025-12-04T09:41:43.2609525Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.2609529Z 2025-12-04T09:41:43.2609632Z # inductor generates a suffix 2025-12-04T09:41:43.2609720Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.2609933Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.2610021Z ''', device_str='cuda') 2025-12-04T09:41:43.2610025Z 2025-12-04T09:41:43.2610028Z 2025-12-04T09:41:43.2610125Z async_compile.wait(globals()) 2025-12-04T09:41:43.2610251Z del async_compile 2025-12-04T09:41:43.2610255Z 2025-12-04T09:41:43.2610339Z class Runner: 2025-12-04T09:41:43.2610437Z def __init__(self, partitions): 2025-12-04T09:41:43.2610541Z self.partitions = partitions 2025-12-04T09:41:43.2610549Z 2025-12-04T09:41:43.2610657Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.2610744Z new_callables = [] 2025-12-04T09:41:43.2610864Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.2610968Z new_callables.append(fn(c)) 2025-12-04T09:41:43.2611072Z self.partitions = new_callables 2025-12-04T09:41:43.2611076Z 2025-12-04T09:41:43.2611162Z def call(self, args): 2025-12-04T09:41:43.2611249Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.2611377Z args.clear() 2025-12-04T09:41:43.2611503Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.2611626Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.2611738Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.2611833Z torch.cuda.set_device(0) 2025-12-04T09:41:43.2611999Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.2612222Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.2612318Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.2612509Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.2612590Z del arg0_1 2025-12-04T09:41:43.2612749Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.2613005Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.2613105Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.2613322Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.2613410Z del arg1_1 2025-12-04T09:41:43.2613486Z del buf0 2025-12-04T09:41:43.2613572Z return (buf1, ) 2025-12-04T09:41:43.2613576Z 2025-12-04T09:41:43.2613675Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.2613758Z call = runner.call 2025-12-04T09:41:43.2613994Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.2613998Z 2025-12-04T09:41:43.2614002Z 2025-12-04T09:41:43.2614140Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.2614268Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.2614416Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.2614615Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.2614820Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.2614918Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.2615080Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.2615087Z 2025-12-04T09:41:43.2615091Z 2025-12-04T09:41:43.2615180Z if __name__ == "__main__": 2025-12-04T09:41:43.2615378Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.2615540Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.2615621Z From CHECK: .to( 2025-12-04T09:41:43.2615626Z 2025-12-04T09:41:43.2615629Z 2025-12-04T09:41:43.2615801Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.2616356Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.2616361Z 2025-12-04T09:41:43.2616580Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.2616756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2616850Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2617021Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2617272Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2618706Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2618793Z graph_break [] 2025-12-04T09:41:43.2618894Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2619198Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2619292Z Autotune Choices Stats: 2025-12-04T09:41:43.2620129Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2620231Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2620317Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2620422Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2620914Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2621375Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2621842Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2622306Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2622843Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2623318Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2623775Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2624234Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2624692Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2625164Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2625497Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.2625588Z Autotune Choices Stats: 2025-12-04T09:41:43.2626416Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2626585Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2626670Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2626775Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2627253Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2627762Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2628219Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2628721Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2629178Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2629637Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2630096Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2630560Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2631030Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2631495Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2631906Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.2632082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2632174Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2632308Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2632552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2633496Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2633585Z graph_break [] 2025-12-04T09:41:43.2633687Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2633862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2633955Z Autotune Choices Stats: 2025-12-04T09:41:43.2634785Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2634882Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2634967Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2635075Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2635549Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2636064Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2636538Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2637014Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2637497Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2638017Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2638482Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2638954Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2639424Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2639982Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2640320Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.2640416Z Autotune Choices Stats: 2025-12-04T09:41:43.2641354Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2641448Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2641535Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2641637Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2642113Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2642580Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2643052Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2643519Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2643978Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2644444Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2644907Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2645420Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2645892Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2646363Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2646699Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.2646849Z Autotune Choices Stats: 2025-12-04T09:41:43.2647736Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.2647831Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2647916Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2648036Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2648509Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2648972Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2649431Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2649899Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2650450Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2650920Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2651392Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2651861Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2652328Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2652792Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2653118Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.2653295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2653388Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2653521Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2653764Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2654706Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2654835Z graph_break [] 2025-12-04T09:41:43.2654942Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2655114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2655210Z Autotune Choices Stats: 2025-12-04T09:41:43.2656038Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2656175Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2656260Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2656363Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2656845Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2657316Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2657843Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2658304Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2658781Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2659250Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2659825Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2660302Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2660762Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2661226Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2661559Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.2661650Z Autotune Choices Stats: 2025-12-04T09:41:43.2662477Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.2662569Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2662657Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2662760Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2663227Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2663702Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2664211Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2664675Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2665134Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2665638Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2666099Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2666574Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2667046Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2667519Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2667854Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.2667945Z Autotune Choices Stats: 2025-12-04T09:41:43.2668774Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2668951Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2669039Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2669152Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2669626Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2670096Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2670578Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2671042Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2671508Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2671967Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2672429Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2672892Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2673396Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2673865Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2674193Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.2674367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2674503Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2674633Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2674882Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2676268Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2676353Z graph_break [] 2025-12-04T09:41:43.2676455Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2676629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2676722Z Autotune Choices Stats: 2025-12-04T09:41:43.2677616Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2677713Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2677797Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2677900Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2678464Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2678926Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2679392Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2679902Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2680372Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2680845Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2681313Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2681786Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2682257Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2682770Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2683102Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.2683193Z Autotune Choices Stats: 2025-12-04T09:41:43.2684022Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2684156Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2684245Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2684347Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2684823Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2685298Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2685776Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2686240Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2686712Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2687201Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2687779Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2688254Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2688734Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2689209Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2689540Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.2689716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2689809Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2689941Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2690186Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2691132Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2691216Z graph_break [] 2025-12-04T09:41:43.2691319Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2691539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2691629Z Autotune Choices Stats: 2025-12-04T09:41:43.2692474Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.2692566Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2692651Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2692760Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2693241Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2693756Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2694227Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2694695Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2695164Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2695631Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2696103Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2696680Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2697156Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2697630Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2697957Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.2698055Z Autotune Choices Stats: 2025-12-04T09:41:43.2698883Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2698981Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2699070Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2699173Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2699650Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2700122Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2700742Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2701282Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2701748Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2702227Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2702701Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2703235Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2703709Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2704190Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2704517Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.2704607Z Autotune Choices Stats: 2025-12-04T09:41:43.2705445Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2705542Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2705628Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2705737Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2706328Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2706816Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2707292Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2707821Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2708301Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2708775Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2709245Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2709717Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2710194Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2710705Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2711044Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.2711218Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2711312Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2711448Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2711695Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2712692Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2712781Z graph_break [] 2025-12-04T09:41:43.2712887Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2713073Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2713171Z Autotune Choices Stats: 2025-12-04T09:41:43.2714014Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2714112Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2714199Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2714315Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2714794Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2715278Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2715835Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2716316Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2716797Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2717308Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2717810Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2718295Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2718762Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2719239Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2719630Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.2719770Z Autotune Choices Stats: 2025-12-04T09:41:43.2720605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2720700Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2720788Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2720897Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2721373Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2721894Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2722370Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2722848Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2723327Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2723795Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2724261Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2724741Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2725290Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2725773Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2726099Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.2726196Z Autotune Choices Stats: 2025-12-04T09:41:43.2727041Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2727139Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2727228Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2727341Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2727820Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2728297Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2728771Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2729256Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2729790Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2730275Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2730763Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2731299Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2731777Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2732245Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2732578Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.2732752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2732849Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2732985Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2733235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2734690Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2734780Z graph_break [] 2025-12-04T09:41:43.2734886Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2735063Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2735153Z Autotune Choices Stats: 2025-12-04T09:41:43.2735981Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2736084Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2736171Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2736280Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2736762Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2737237Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2741059Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2741572Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2742119Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2742600Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2743079Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2743558Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2744081Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2744558Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2744889Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.2744983Z Autotune Choices Stats: 2025-12-04T09:41:43.2745826Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2745926Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2746010Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2746115Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2746601Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2747154Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2747646Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2748143Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2748614Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2749079Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2749548Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2750021Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2750492Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2750967Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2751341Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.2751517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2751618Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2751754Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2752002Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2753386Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2753512Z graph_break [] 2025-12-04T09:41:43.2753621Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2753794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2753885Z Autotune Choices Stats: 2025-12-04T09:41:43.2754741Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2754833Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2754920Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2755023Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2755510Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2755986Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2756537Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2757009Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2757475Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2757943Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2758416Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2758889Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2759362Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2759900Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2760241Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.2760332Z Autotune Choices Stats: 2025-12-04T09:41:43.2761219Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2761314Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2761398Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2761503Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2761979Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2762452Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2762966Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2763443Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2763909Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2764372Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2764843Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2765312Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2765861Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2766335Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2766660Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.2766835Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2766930Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2767060Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2767304Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2768738Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2768823Z graph_break [] 2025-12-04T09:41:43.2768925Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2769096Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2769190Z Autotune Choices Stats: 2025-12-04T09:41:43.2770033Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.2770286Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2770370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2770474Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2770953Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2771425Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2771897Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2772412Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2772891Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2773365Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2773838Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2774308Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2774772Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2775319Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2775644Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.2775735Z Autotune Choices Stats: 2025-12-04T09:41:43.2776573Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2776667Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2776754Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2776860Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2777354Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2777862Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2778333Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2778804Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2779273Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2779788Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2780255Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2780727Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2781198Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2781705Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2782035Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.2782209Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2782303Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2782434Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2782678Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2784084Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2784169Z graph_break [] 2025-12-04T09:41:43.2784271Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2784445Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2784614Z Autotune Choices Stats: 2025-12-04T09:41:43.2785449Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2785541Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2785628Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2785734Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2786214Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2786699Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2787174Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2787651Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2788131Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2788612Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2789141Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2789612Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2790085Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2790551Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2790930Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.2791025Z Autotune Choices Stats: 2025-12-04T09:41:43.2791854Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2791947Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2792032Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2792134Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2792619Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2793087Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2793559Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2794108Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2794577Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2795041Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2795507Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2795981Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2796455Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2796926Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2797255Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.2797428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2797541Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2797737Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2797990Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2799362Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2799443Z graph_break [] 2025-12-04T09:41:43.2799609Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2799826Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2799920Z Autotune Choices Stats: 2025-12-04T09:41:43.2800929Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2801029Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2801117Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2801219Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2801710Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2802184Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2802655Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2803127Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2803747Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2804215Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2804687Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2805165Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2805636Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2806113Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2806444Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.2806536Z Autotune Choices Stats: 2025-12-04T09:41:43.2807416Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2807579Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2807663Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2807768Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2808249Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2808725Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2809196Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2809718Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2810189Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2810657Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2811125Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2811594Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2812070Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2812541Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2812948Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.2813125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2813219Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2813352Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2813597Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2814538Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2814625Z graph_break [] 2025-12-04T09:41:43.2814726Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2814900Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2814993Z Autotune Choices Stats: 2025-12-04T09:41:43.2815822Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2815916Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2815999Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2816105Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2816583Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2817093Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2817565Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2818030Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2818504Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2819014Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2819491Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2819972Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2820447Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2820919Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2821252Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.2821347Z Autotune Choices Stats: 2025-12-04T09:41:43.2822259Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2822353Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2822443Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2822546Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2823020Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2823499Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2823969Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2824441Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2824904Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2825371Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2825838Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2826353Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2826832Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2827303Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2827683Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.2827815Z Autotune Choices Stats: 2025-12-04T09:41:43.2828672Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2828765Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2828849Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2828964Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2829447Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2829925Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2830400Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2830879Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2831441Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2831918Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2832398Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2832876Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2833365Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2833847Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2834173Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.2834347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2834439Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2834572Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2834815Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2835757Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2835886Z graph_break [] 2025-12-04T09:41:43.2835992Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2836163Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2836255Z Autotune Choices Stats: 2025-12-04T09:41:43.2837088Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2837221Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2837306Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2837420Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2837955Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2838439Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2838920Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2839402Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2839932Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2840408Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2840980Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2841450Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2841916Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2842389Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2842718Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.2842809Z Autotune Choices Stats: 2025-12-04T09:41:43.2843647Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2843739Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2843825Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2843929Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2844406Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2844921Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2845389Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2845858Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2846322Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2846832Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2847298Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2847771Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2848244Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2848717Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2849052Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.2849144Z Autotune Choices Stats: 2025-12-04T09:41:43.2850599Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2850739Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2850853Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2851003Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2851591Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2852066Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2852540Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2853022Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2853500Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2853975Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2854462Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2854982Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2855452Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2855919Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2856248Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.2856467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2856559Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2856688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2856936Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2857924Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2858011Z graph_break [] 2025-12-04T09:41:43.2858111Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2858282Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2858374Z Autotune Choices Stats: 2025-12-04T09:41:43.2859205Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2859303Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2859387Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2859489Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2860050Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2860524Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2860997Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2861475Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2861957Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2862443Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2862908Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2863380Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2863846Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2864353Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2864687Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.2864777Z Autotune Choices Stats: 2025-12-04T09:41:43.2865604Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2865736Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2865822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2865923Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2866397Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2866877Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2867367Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2867860Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2868329Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2868799Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2869340Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2869813Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2870286Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2870760Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2871091Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.2871181Z Autotune Choices Stats: 2025-12-04T09:41:43.2872016Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2872111Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2872194Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2872305Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2872786Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2873260Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2873775Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2874240Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2874707Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2875237Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2875710Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2876186Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2876663Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2877140Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2877467Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.2877642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2877736Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2877864Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2878112Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2879632Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2879720Z graph_break [] 2025-12-04T09:41:43.2879821Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2879995Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2880092Z Autotune Choices Stats: 2025-12-04T09:41:43.2880927Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2881027Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2881111Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2881214Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2881693Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2882165Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2882652Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2883181Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2883657Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2884125Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2884633Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2885102Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2885578Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2886051Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2886375Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.2886464Z Autotune Choices Stats: 2025-12-04T09:41:43.2887356Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2887450Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2887538Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2887640Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2888192Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2888663Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2889127Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2889598Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2890068Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2890537Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2891003Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2891477Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2891949Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2892463Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2892794Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.2892965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2893059Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2893196Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2893481Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2894867Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2894951Z graph_break [] 2025-12-04T09:41:43.2895053Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2895227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2895316Z Autotune Choices Stats: 2025-12-04T09:41:43.2896153Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.2896247Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2896331Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2896443Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2896923Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2897507Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2898002Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2898469Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2898942Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2899418Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2899893Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2900836Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2901315Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2901788Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2902201Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.2902300Z Autotune Choices Stats: 2025-12-04T09:41:43.2903126Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.2903222Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2903307Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2903478Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2903957Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2904424Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2904896Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2905368Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2905833Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2906302Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2906770Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2907354Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2907828Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2908300Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2908630Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.2908806Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2908902Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2909030Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2909283Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2910655Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2910740Z graph_break [] 2025-12-04T09:41:43.2910846Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2911090Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2911182Z Autotune Choices Stats: 2025-12-04T09:41:43.2912027Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2912119Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2912207Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2912311Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2912794Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2913317Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2913796Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2914290Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2914769Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2915240Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2915712Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2916353Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2916821Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2917335Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2917669Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.2917763Z Autotune Choices Stats: 2025-12-04T09:41:43.2918609Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2918703Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2918796Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2918901Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2919380Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2919902Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2920373Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2920886Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2921369Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2921835Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2922303Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2922823Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2923309Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2923785Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2924117Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.2924293Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2924387Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2924520Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2924764Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2926231Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2926324Z graph_break [] 2025-12-04T09:41:43.2926426Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2926599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2926688Z Autotune Choices Stats: 2025-12-04T09:41:43.2927563Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2927664Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2927750Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2927853Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2928337Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2928816Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2929297Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2929779Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2930300Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2930794Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2931264Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2931741Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2932243Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2932714Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2933049Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.2933144Z Autotune Choices Stats: 2025-12-04T09:41:43.2933984Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2934078Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2934167Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2934272Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2934747Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2935307Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2935782Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2936263Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2936727Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2937198Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2937667Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2938143Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2938614Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2939089Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2939454Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.2939627Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2939726Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2939857Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2940102Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2941039Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2941168Z graph_break [] 2025-12-04T09:41:43.2941270Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2941449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2941542Z Autotune Choices Stats: 2025-12-04T09:41:43.2942400Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.2942491Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2942583Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2942690Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2943172Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2943647Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2944116Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2944690Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2945161Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2945637Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2946108Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2946581Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2947081Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2947575Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2947900Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.2947999Z Autotune Choices Stats: 2025-12-04T09:41:43.2948837Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.2948974Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2949064Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2949166Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2949645Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2950123Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2950635Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2951103Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2951574Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2952039Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2952511Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2952996Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2953471Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2954027Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2954359Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.2954450Z Autotune Choices Stats: 2025-12-04T09:41:43.2955281Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2955379Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2955470Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2955581Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.2956062Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2956546Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2957026Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2957543Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2958021Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2958544Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2959021Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2959549Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2960070Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2960540Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2960877Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.2961052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2961149Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2961282Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2961528Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2962912Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2962999Z graph_break [] 2025-12-04T09:41:43.2963107Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2963357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2963449Z Autotune Choices Stats: 2025-12-04T09:41:43.2964310Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.2964404Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2964488Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2964595Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2965078Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2965561Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2966032Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2966504Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2967013Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2967544Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2968026Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2968492Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2968979Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2969496Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2969831Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.2969926Z Autotune Choices Stats: 2025-12-04T09:41:43.2970778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2970878Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2970963Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2971066Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2971551Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2972025Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2972578Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2973053Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2973530Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2974002Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2974469Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2974952Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2975424Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2975899Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2976230Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.2976401Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2976541Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2976672Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2976944Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2978346Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2978476Z graph_break [] 2025-12-04T09:41:43.2978578Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2978751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2978844Z Autotune Choices Stats: 2025-12-04T09:41:43.2979685Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2979777Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2979865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2979969Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2980452Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2980931Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2981402Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2981981Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2982459Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.2982932Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2983400Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2983870Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2984341Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2984806Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2985135Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.2985226Z Autotune Choices Stats: 2025-12-04T09:41:43.2986060Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2986192Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2986278Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2986386Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.2986870Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2987392Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2987856Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2988365Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2988841Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.2989307Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2989780Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2990256Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2990742Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2991293Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2991626Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.2991805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.2991901Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.2992032Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.2992277Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.2993659Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.2993749Z graph_break [] 2025-12-04T09:41:43.2993856Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.2994033Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.2994123Z Autotune Choices Stats: 2025-12-04T09:41:43.2994955Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.2995053Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.2995138Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.2995286Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.2995769Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.2996255Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2996734Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.2997262Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2997789Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2998265Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.2998736Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.2999209Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.2999772Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3000385Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3000722Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.3000938Z Autotune Choices Stats: 2025-12-04T09:41:43.3001779Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3001870Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3001960Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3002064Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3002548Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3003023Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3003492Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3003963Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3004430Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3004904Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3005425Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3005907Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3006378Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3006866Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3007283Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.3007459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3007556Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3007688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3007937Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3009310Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3009395Z graph_break [] 2025-12-04T09:41:43.3009504Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3009680Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3009770Z Autotune Choices Stats: 2025-12-04T09:41:43.3010683Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3010778Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3010865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3010970Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3011448Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3011930Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3012399Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3012880Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3013344Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3013815Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3014291Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3014809Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3015283Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3015762Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3016097Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.3016255Z Autotune Choices Stats: 2025-12-04T09:41:43.3017130Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3017241Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3017340Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3017448Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3017923Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3018396Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3018880Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3019349Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3019896Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3023961Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3024461Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3024941Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3025416Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3025894Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3026225Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.3026404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3026497Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3026630Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3026875Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3027810Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3027970Z graph_break [] 2025-12-04T09:41:43.3028072Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3028244Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3028337Z Autotune Choices Stats: 2025-12-04T09:41:43.3029163Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3029300Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3029385Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3029491Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3029978Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3030457Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3030938Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3031411Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3031898Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3032464Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3032938Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3033409Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3033880Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3034351Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3034682Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.3034777Z Autotune Choices Stats: 2025-12-04T09:41:43.3035603Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3035696Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3035786Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3035891Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3036366Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3036902Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3037397Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3037871Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3038334Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3038841Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3039310Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3039860Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3040344Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3040814Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3041147Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.3041241Z Autotune Choices Stats: 2025-12-04T09:41:43.3042151Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3042246Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3042330Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3042440Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3042917Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3043391Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3043867Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3044341Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3044824Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3045299Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3045779Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3046312Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3046798Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3047325Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3047650Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.3047866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3047958Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3048089Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3048341Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3049721Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3049806Z graph_break [] 2025-12-04T09:41:43.3049908Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3050082Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3050173Z Autotune Choices Stats: 2025-12-04T09:41:43.3051002Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3051097Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3051260Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3051365Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3051843Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3052315Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3052787Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3053261Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3053735Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3054206Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3054677Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3055148Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3055695Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3056170Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3056499Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.3056589Z Autotune Choices Stats: 2025-12-04T09:41:43.3057423Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3057555Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3057646Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3057748Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3058234Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3058710Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3059182Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3059657Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3060123Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3060666Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3061135Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3061606Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3062083Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3062555Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3062887Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.3063106Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.3063209Z Traceback (most recent call last): 2025-12-04T09:41:43.3063627Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.3063809Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.3064154Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.3064336Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.3064539Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.3064626Z Searched string: 2025-12-04T09:41:43.3064756Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.3064762Z 2025-12-04T09:41:43.3064885Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.3064889Z 2025-12-04T09:41:43.3065021Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.3065144Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.3065148Z 2025-12-04T09:41:43.3065241Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.3065328Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.3065418Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.3065512Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.3065560Z 2025-12-04T09:41:43.3065645Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.3065733Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.3065824Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.3065914Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.3065919Z 2025-12-04T09:41:43.3065922Z 2025-12-04T09:41:43.3066081Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.3066086Z 2025-12-04T09:41:43.3066089Z 2025-12-04T09:41:43.3066211Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.3066324Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.3066438Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.3066521Z idx_m = rm[:, None] 2025-12-04T09:41:43.3066605Z idx_n = rn[None, :] 2025-12-04T09:41:43.3066703Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.3066707Z 2025-12-04T09:41:43.3066805Z # inductor generates a suffix 2025-12-04T09:41:43.3066924Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.3067163Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.3067249Z ''', device_str='cuda') 2025-12-04T09:41:43.3067253Z 2025-12-04T09:41:43.3067259Z 2025-12-04T09:41:43.3067359Z async_compile.wait(globals()) 2025-12-04T09:41:43.3067439Z del async_compile 2025-12-04T09:41:43.3067443Z 2025-12-04T09:41:43.3067520Z class Runner: 2025-12-04T09:41:43.3067622Z def __init__(self, partitions): 2025-12-04T09:41:43.3067887Z self.partitions = partitions 2025-12-04T09:41:43.3067892Z 2025-12-04T09:41:43.3068007Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.3068095Z new_callables = [] 2025-12-04T09:41:43.3068209Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.3068314Z new_callables.append(fn(c)) 2025-12-04T09:41:43.3068414Z self.partitions = new_callables 2025-12-04T09:41:43.3068418Z 2025-12-04T09:41:43.3068506Z def call(self, args): 2025-12-04T09:41:43.3068599Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.3068678Z args.clear() 2025-12-04T09:41:43.3068802Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.3068929Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.3069034Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.3069131Z torch.cuda.set_device(0) 2025-12-04T09:41:43.3069295Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.3069514Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.3069612Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.3069798Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.3069879Z del arg0_1 2025-12-04T09:41:43.3070040Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.3070297Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.3070393Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.3070610Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.3070733Z del arg1_1 2025-12-04T09:41:43.3070813Z del buf0 2025-12-04T09:41:43.3070894Z return (buf1, ) 2025-12-04T09:41:43.3070898Z 2025-12-04T09:41:43.3070996Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.3071084Z call = runner.call 2025-12-04T09:41:43.3071240Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.3071245Z 2025-12-04T09:41:43.3071248Z 2025-12-04T09:41:43.3071387Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.3071515Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.3071660Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.3071861Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.3072103Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.3072200Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.3072368Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.3072373Z 2025-12-04T09:41:43.3072377Z 2025-12-04T09:41:43.3072463Z if __name__ == "__main__": 2025-12-04T09:41:43.3072664Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.3072823Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.3072905Z From CHECK: .to( 2025-12-04T09:41:43.3072909Z 2025-12-04T09:41:43.3072913Z 2025-12-04T09:41:43.3073089Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.3073641Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.3073648Z 2025-12-04T09:41:43.3073865Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.3074038Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3074133Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3074265Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3074510Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3075963Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3076047Z graph_break [] 2025-12-04T09:41:43.3076153Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3076326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3076416Z Autotune Choices Stats: 2025-12-04T09:41:43.3077307Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3077402Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3077486Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3077592Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3078072Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3078537Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3079005Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3079566Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3080027Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3080494Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3080994Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3081448Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3081909Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3082372Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3082702Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.3082793Z Autotune Choices Stats: 2025-12-04T09:41:43.3083618Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3083716Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3083801Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3083903Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3084485Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3084944Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3085402Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3085861Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3086319Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3086778Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3087232Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3087697Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3088162Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3088669Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3089003Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.3089176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3089271Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3089401Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3089646Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3090636Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3090720Z graph_break [] 2025-12-04T09:41:43.3090824Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3090999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3091088Z Autotune Choices Stats: 2025-12-04T09:41:43.3091915Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3092007Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3092096Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3092199Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3092674Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3093146Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3093694Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3094173Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3094647Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3095131Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3095599Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3096067Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3096537Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3096999Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3097346Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.3097494Z Autotune Choices Stats: 2025-12-04T09:41:43.3098347Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3098441Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3098525Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3098629Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3099097Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3099603Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3100077Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3100714Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3101179Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3101635Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3102099Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3102567Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3103161Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3103634Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3103959Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.3104054Z Autotune Choices Stats: 2025-12-04T09:41:43.3104877Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.3104973Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3105062Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3105172Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3105648Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3106109Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3106572Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3107096Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3107616Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3108105Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3108567Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3109090Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3109558Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3110021Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3110352Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.3110524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3110619Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3110751Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3110996Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3111935Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3112019Z graph_break [] 2025-12-04T09:41:43.3112201Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3112375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3112465Z Autotune Choices Stats: 2025-12-04T09:41:43.3113302Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.3113396Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3113484Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3113586Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3114062Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3114530Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3114990Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3115450Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3115918Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3116427Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3116900Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3117401Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3117886Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3118388Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3118721Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.3118811Z Autotune Choices Stats: 2025-12-04T09:41:43.3119697Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.3119792Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3119877Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3119981Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3120453Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3120921Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3121511Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3121973Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3122436Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3122898Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3123359Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3123832Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3124298Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3124768Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3125098Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.3125190Z Autotune Choices Stats: 2025-12-04T09:41:43.3126066Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3126157Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3126244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3126353Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3126833Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3127307Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3127823Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3128292Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3128752Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3129213Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3129674Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3130136Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3130671Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3131141Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3131469Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.3131640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3131737Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3131865Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3132110Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3133502Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3133585Z graph_break [] 2025-12-04T09:41:43.3133689Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3133860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3133952Z Autotune Choices Stats: 2025-12-04T09:41:43.3134788Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3134955Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3135041Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3135149Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3135628Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3136092Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3136552Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3137061Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3137563Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3138054Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3138522Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3138992Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3139459Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3139997Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3140333Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.3140422Z Autotune Choices Stats: 2025-12-04T09:41:43.3141249Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3141344Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3141428Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3141532Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3142007Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3142480Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3142956Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3143417Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3143889Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3144391Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3144864Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3145343Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3145815Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3146344Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3146676Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.3146856Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3146952Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3147087Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3147378Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3148315Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3148407Z graph_break [] 2025-12-04T09:41:43.3148511Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3148686Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3148780Z Autotune Choices Stats: 2025-12-04T09:41:43.3149699Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.3149796Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3149888Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3149991Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3150475Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3150951Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3151428Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3151898Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3152363Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3152836Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3153315Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3153836Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3154308Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3154782Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3155185Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.3155278Z Autotune Choices Stats: 2025-12-04T09:41:43.3156114Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3156213Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3156302Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3156406Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3156878Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3157352Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3157830Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3158298Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3158844Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3159310Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3159840Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3160315Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3160797Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3161269Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3161598Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.3161699Z Autotune Choices Stats: 2025-12-04T09:41:43.3162536Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3162676Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3162764Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3162875Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3163364Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3163840Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3164311Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3164816Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3165294Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3165764Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3166229Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3166712Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3167187Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3167709Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3168121Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.3168300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3168394Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3168526Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3168771Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3169712Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3169797Z graph_break [] 2025-12-04T09:41:43.3169904Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3170079Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3170172Z Autotune Choices Stats: 2025-12-04T09:41:43.3170998Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3171089Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3171179Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3171281Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3171761Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3172277Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3172752Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3173231Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3173706Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3174237Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3174724Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3175205Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3175680Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3176160Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3176494Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.3176589Z Autotune Choices Stats: 2025-12-04T09:41:43.3177527Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3177636Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3177731Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3177843Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3178317Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3178810Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3179285Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3179759Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3180232Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3180696Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3181169Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3181690Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3182167Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3182637Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3182961Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.3183098Z Autotune Choices Stats: 2025-12-04T09:41:43.3183926Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3184027Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3184116Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3184228Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3184707Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3185179Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3185660Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3186139Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3186694Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3187186Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3187670Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3188144Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3188619Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3189093Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3189423Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.3189597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3189693Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3189826Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3190075Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3191462Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3191610Z graph_break [] 2025-12-04T09:41:43.3191719Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3191891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3191985Z Autotune Choices Stats: 2025-12-04T09:41:43.3192816Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3192952Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3193041Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3193144Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3193629Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3194105Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3194577Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3195061Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3195541Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3196101Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3196583Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3197115Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3197595Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3198070Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3198410Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.3198503Z Autotune Choices Stats: 2025-12-04T09:41:43.3199355Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3199450Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3199590Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3199697Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3200221Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3200832Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3201305Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3201771Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3202312Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3202784Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3203260Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3203733Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3204207Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3204683Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3205015Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.3205195Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3205397Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3205536Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3205779Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3207209Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3207300Z graph_break [] 2025-12-04T09:41:43.3207403Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3207579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3207670Z Autotune Choices Stats: 2025-12-04T09:41:43.3208533Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3208630Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3208716Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3208822Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3209308Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3209839Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3210318Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3210788Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3211260Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3211838Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3212315Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3212790Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3213263Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3213740Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3214071Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.3214169Z Autotune Choices Stats: 2025-12-04T09:41:43.3215084Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3215177Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3215266Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3215370Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3215853Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3216328Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3216798Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3217278Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3217745Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3218213Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3218691Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3219215Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3219704Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3220176Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3220511Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.3220722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3220819Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3220950Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3221199Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3222594Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3222677Z graph_break [] 2025-12-04T09:41:43.3222784Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3222956Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3223050Z Autotune Choices Stats: 2025-12-04T09:41:43.3223904Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.3224000Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3224092Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3224304Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3224785Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3225265Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3225749Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3226230Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3226706Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3227231Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3227710Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3228181Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3228692Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3229162Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3229493Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.3229586Z Autotune Choices Stats: 2025-12-04T09:41:43.3230420Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3230555Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3230644Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3230749Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3231234Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3231705Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3232184Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3232658Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3233130Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3233676Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3234144Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3234619Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3235102Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3235579Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3235915Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.3236092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3236186Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3236315Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3236562Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3237977Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3238108Z graph_break [] 2025-12-04T09:41:43.3238212Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3238389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3238483Z Autotune Choices Stats: 2025-12-04T09:41:43.3239310Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3239452Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3239599Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3239704Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3240181Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3240662Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3241139Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3241616Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3242096Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3242576Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3243177Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3243655Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3244134Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3244608Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3244943Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.3245034Z Autotune Choices Stats: 2025-12-04T09:41:43.3245873Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.3245967Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3246057Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3246158Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3246636Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3247111Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3247621Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3248100Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3248577Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3249046Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3249559Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3250038Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3250515Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3250985Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3251318Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.3251490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3251582Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3251720Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3251962Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3253424Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3253509Z graph_break [] 2025-12-04T09:41:43.3253616Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3253795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3253884Z Autotune Choices Stats: 2025-12-04T09:41:43.3254736Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3254831Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3254916Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3255024Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3255519Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3255992Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3256467Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3257012Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3257506Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3257971Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3258450Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3258961Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3259438Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3259913Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3260244Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.3260340Z Autotune Choices Stats: 2025-12-04T09:41:43.3261175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3261276Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3261362Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3261468Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3262052Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3262531Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3263006Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3263475Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3263943Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3264419Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3264884Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3267647Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3268155Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3268683Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3269016Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.3269192Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3269289Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3269421Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3269668Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3270675Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3270762Z graph_break [] 2025-12-04T09:41:43.3270866Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3271048Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3271140Z Autotune Choices Stats: 2025-12-04T09:41:43.3271969Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.3272067Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3272159Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3272269Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3272747Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3273257Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3273729Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3274196Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3274682Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3275154Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3275635Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3276112Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3276587Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3277150Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3277482Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.3277617Z Autotune Choices Stats: 2025-12-04T09:41:43.3278445Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3278542Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3278626Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3278729Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3279212Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3279835Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3280310Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3280780Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3281243Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3281714Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3282192Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3282712Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3283189Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3283664Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3283995Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.3284091Z Autotune Choices Stats: 2025-12-04T09:41:43.3288629Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3288745Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3288843Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3288958Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3289449Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3289927Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3290485Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3291008Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3291486Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3291961Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3292442Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3292962Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3293451Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3293937Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3294267Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.3294445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3294544Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3294681Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3294931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3295918Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3296009Z graph_break [] 2025-12-04T09:41:43.3296116Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3296298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3296392Z Autotune Choices Stats: 2025-12-04T09:41:43.3297267Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3297370Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3297459Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3297574Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3298049Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3298534Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3299016Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3299547Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3300026Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3300750Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3301223Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3301690Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3302320Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3302871Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3303260Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.3303363Z Autotune Choices Stats: 2025-12-04T09:41:43.3304358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3304455Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3304550Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3304664Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3305228Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3305788Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3306405Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3306961Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3307512Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3308073Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3308625Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3309187Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3309744Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3310358Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3310758Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.3310905Z Autotune Choices Stats: 2025-12-04T09:41:43.3311903Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3312000Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3312089Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3312211Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3312771Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3313378Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3313938Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3314504Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3315070Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3315633Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3316211Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3316779Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3317422Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3317975Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3318361Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.3318565Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3318664Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3318807Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3319090Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3320158Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3320248Z graph_break [] 2025-12-04T09:41:43.3320352Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3320528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3320621Z Autotune Choices Stats: 2025-12-04T09:41:43.3321515Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3321657Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3321746Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3321854Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3322341Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3322816Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3323292Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3323816Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3324304Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3324779Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3325244Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3325726Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3326193Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3326705Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3327078Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.3327188Z Autotune Choices Stats: 2025-12-04T09:41:43.3328017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.3328118Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3328212Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3328320Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3328799Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3329272Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3329739Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3330209Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3330727Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3331238Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3331714Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3332191Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3332664Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3333179Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3333517Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.3333611Z Autotune Choices Stats: 2025-12-04T09:41:43.3334451Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3334546Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3334633Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3334752Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3335242Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3335719Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3336229Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3336697Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3337166Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3337636Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3338116Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3338590Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3339067Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3339545Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3339948Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.3340128Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3340264Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3340397Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3340642Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3342009Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3342135Z graph_break [] 2025-12-04T09:41:43.3342243Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3342420Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3342516Z Autotune Choices Stats: 2025-12-04T09:41:43.3343345Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3343444Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3343532Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3343637Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3344115Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3344592Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3345075Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3345591Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3346066Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3346530Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3347006Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3347500Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3348005Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3348491Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3348821Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.3348921Z Autotune Choices Stats: 2025-12-04T09:41:43.3349811Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3349946Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3350037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3350146Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3350626Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3351096Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3351605Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3352077Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3352546Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3353015Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3353480Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3353960Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3354437Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3354948Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3355280Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.3355453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3355552Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3355685Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3355935Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3357331Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3357425Z graph_break [] 2025-12-04T09:41:43.3357557Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3357731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3357828Z Autotune Choices Stats: 2025-12-04T09:41:43.3358717Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.3358816Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3358947Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3359054Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3359586Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3360062Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3360531Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3361045Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3361512Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3361989Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3362464Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3362934Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3363406Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3363882Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3364254Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.3364349Z Autotune Choices Stats: 2025-12-04T09:41:43.3365194Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3365293Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3365384Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3365498Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3365975Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3366448Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3366924Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3367407Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3368024Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3368492Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3368998Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3369475Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3369948Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3370473Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3370800Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.3370982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3371077Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3371211Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3371464Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3372844Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3372932Z graph_break [] 2025-12-04T09:41:43.3373038Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3373217Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3373312Z Autotune Choices Stats: 2025-12-04T09:41:43.3374219Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3374319Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3374409Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3374517Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3375003Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3375486Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3375972Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3376447Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3376928Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3377500Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3377976Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3378492Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3378960Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3379428Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3379802Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.3379897Z Autotune Choices Stats: 2025-12-04T09:41:43.3380750Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3380849Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3380939Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3381046Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3381527Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3382001Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3382473Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3382945Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3383448Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3383916Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3384385Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3384859Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3385339Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3385812Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3386147Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.3386322Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3386420Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3386599Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3386844Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3388311Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3388396Z graph_break [] 2025-12-04T09:41:43.3388500Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3388678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3388773Z Autotune Choices Stats: 2025-12-04T09:41:43.3389650Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3389747Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3389836Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3389943Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3390421Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3390901Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3391380Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3391861Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3392385Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3392868Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3393338Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3393811Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3394282Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3394756Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3395089Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.3395186Z Autotune Choices Stats: 2025-12-04T09:41:43.3396024Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3396173Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3396264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3396369Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3396890Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3397365Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3397841Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3398313Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3398820Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3399294Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3399801Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3400416Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3400891Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3401371Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3401700Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.3401942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3402044Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3402173Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3402417Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3403357Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3403442Z graph_break [] 2025-12-04T09:41:43.3403547Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3403723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3403813Z Autotune Choices Stats: 2025-12-04T09:41:43.3404663Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.3404755Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3404843Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3404948Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3405492Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3405964Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3406497Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3406972Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3407482Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3407972Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3408516Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3408991Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3409457Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3409929Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3410259Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.3410353Z Autotune Choices Stats: 2025-12-04T09:41:43.3411175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3411313Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3411401Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3411508Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3411979Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3412454Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3412927Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3413395Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3413865Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3414328Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3414841Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3415315Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3415824Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3416296Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3416624Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.3416719Z Autotune Choices Stats: 2025-12-04T09:41:43.3417590Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3417732Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3417820Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3417931Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3418408Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3418880Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3419349Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3419826Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3420307Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3420821Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3421296Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3421781Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3422257Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3422723Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3423055Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.3423226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3423322Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3423450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3423693Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3425111Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3425236Z graph_break [] 2025-12-04T09:41:43.3425345Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3425515Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3425606Z Autotune Choices Stats: 2025-12-04T09:41:43.3426452Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.3426585Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3426673Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3426776Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3427264Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3427743Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3428213Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3428688Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3429168Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3429645Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3430156Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3430626Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3431104Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3431578Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3431917Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.3432010Z Autotune Choices Stats: 2025-12-04T09:41:43.3432851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3432951Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3433039Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3433145Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3433681Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3434200Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3434681Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3435144Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3435612Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3436116Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3436587Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3437087Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3437583Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3438057Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3438388Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.3438566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3438663Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3438793Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3439083Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3440522Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3440613Z graph_break [] 2025-12-04T09:41:43.3440723Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3440895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3440992Z Autotune Choices Stats: 2025-12-04T09:41:43.3441835Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3441931Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3442020Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3442124Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3442606Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3443129Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3443663Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3444141Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3444622Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3445090Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3445600Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3446073Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3446543Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3447012Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3447345Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.3447438Z Autotune Choices Stats: 2025-12-04T09:41:43.3448331Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3448426Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3448514Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3448658Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3449137Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3449608Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3450078Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3450550Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3451021Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3451491Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3451958Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3452475Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3452988Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3453461Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3453797Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.3453969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3454063Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3454197Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3454489Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3455868Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3455958Z graph_break [] 2025-12-04T09:41:43.3456061Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3456235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3456325Z Autotune Choices Stats: 2025-12-04T09:41:43.3457159Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3457253Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3457341Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3457450Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3457964Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3458449Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3458927Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3459407Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3459894Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3460363Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3460833Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3461305Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3461821Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3462326Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3462660Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.3462755Z Autotune Choices Stats: 2025-12-04T09:41:43.3463590Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3463687Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3463816Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3463921Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3464401Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3464877Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3465348Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3465813Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3466285Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3466748Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3467301Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3467777Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3468247Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3468724Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3469053Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.3469226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3469324Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3469456Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3469709Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3471121Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3471208Z graph_break [] 2025-12-04T09:41:43.3471313Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3471525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3471623Z Autotune Choices Stats: 2025-12-04T09:41:43.3472454Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3472552Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3472641Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3472743Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3473219Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3473725Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3474196Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3474664Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3475127Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3475607Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3476076Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3476594Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3477097Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3477600Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3477944Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.3478035Z Autotune Choices Stats: 2025-12-04T09:41:43.3478875Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3478972Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3479058Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3479167Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3479698Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3480175Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3480717Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3481217Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3481692Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3482155Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3482622Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3483128Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3483607Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3484081Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3484410Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.3484586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3484684Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3484819Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3485063Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3486041Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3486130Z graph_break [] 2025-12-04T09:41:43.3486234Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3486411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3486502Z Autotune Choices Stats: 2025-12-04T09:41:43.3487330Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3487428Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3487516Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3487621Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3488105Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3488578Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3489059Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3489571Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3490053Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3490574Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3491045Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3491514Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3492030Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3492500Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3492832Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.3492926Z Autotune Choices Stats: 2025-12-04T09:41:43.3493757Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3493849Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3493941Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3494045Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3494516Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3494987Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3495490Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3495968Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3496434Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3496905Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3497396Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3497897Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3498375Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3498887Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3499217Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.3499351Z Autotune Choices Stats: 2025-12-04T09:41:43.3500188Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3500403Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3500492Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3500607Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3501085Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3501635Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3502108Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3502581Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3503058Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3503531Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3504015Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3504495Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3505035Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3505510Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3505837Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.3506018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3506112Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3506248Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3506492Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3507916Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3508004Z graph_break [] 2025-12-04T09:41:43.3508106Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3508339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3508435Z Autotune Choices Stats: 2025-12-04T09:41:43.3509277Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3509430Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3509517Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3509621Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3510104Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3510583Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3511096Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3511571Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3512046Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3512516Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3512992Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3513466Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3514060Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3514530Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3514859Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.3514953Z Autotune Choices Stats: 2025-12-04T09:41:43.3515787Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3515887Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3515976Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3516081Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3516566Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3517042Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3517513Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3518030Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3518535Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3519004Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3519514Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3519989Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3520510Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3520984Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3521316Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.3521489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3521587Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3521718Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3521964Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3522913Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3522998Z graph_break [] 2025-12-04T09:41:43.3523105Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3523326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3523422Z Autotune Choices Stats: 2025-12-04T09:41:43.3524248Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3524338Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3524427Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3524536Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3525008Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3525490Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3525966Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3526438Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3526982Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3527491Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3528017Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3528482Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3528963Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3529471Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3529801Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.3529901Z Autotune Choices Stats: 2025-12-04T09:41:43.3530734Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3530830Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3530916Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3531019Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3531490Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3531962Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3532430Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3532941Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3533411Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3533877Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3534343Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3534822Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3535295Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3535771Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3536102Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.3536197Z Autotune Choices Stats: 2025-12-04T09:41:43.3537124Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.3537257Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3537349Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3537459Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3537935Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3538407Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3538924Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3539397Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3539865Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3540336Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3540801Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3541277Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3541752Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3542261Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3542600Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.3542775Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3542868Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3543004Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3543252Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3544632Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3544718Z graph_break [] 2025-12-04T09:41:43.3544822Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3544997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3545091Z Autotune Choices Stats: 2025-12-04T09:41:43.3545972Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3546107Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3546194Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3546305Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3546782Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3547276Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3547746Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3548288Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3548764Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3549230Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3549698Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3550162Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3550638Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3551106Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3551477Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.3551574Z Autotune Choices Stats: 2025-12-04T09:41:43.3552409Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3552510Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3552597Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3552702Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3553183Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3553660Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3554129Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3554596Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3555107Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3555573Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3556079Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3556556Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3557053Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3557594Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3557928Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.3558149Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.3558258Z Traceback (most recent call last): 2025-12-04T09:41:43.3558672Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.3558859Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.3559201Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.3559380Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.3559592Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.3559681Z Searched string: 2025-12-04T09:41:43.3559814Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.3559825Z 2025-12-04T09:41:43.3559944Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.3559948Z 2025-12-04T09:41:43.3560076Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.3560202Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.3560251Z 2025-12-04T09:41:43.3560345Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.3560434Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.3560530Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.3560619Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.3560623Z 2025-12-04T09:41:43.3560709Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.3560797Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.3560888Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.3560984Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.3560989Z 2025-12-04T09:41:43.3560995Z 2025-12-04T09:41:43.3561153Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.3561160Z 2025-12-04T09:41:43.3561164Z 2025-12-04T09:41:43.3561284Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.3561399Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.3561510Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.3561598Z idx_m = rm[:, None] 2025-12-04T09:41:43.3561681Z idx_n = rn[None, :] 2025-12-04T09:41:43.3561774Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.3561778Z 2025-12-04T09:41:43.3561883Z # inductor generates a suffix 2025-12-04T09:41:43.3561972Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.3565576Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.3565682Z ''', device_str='cuda') 2025-12-04T09:41:43.3565692Z 2025-12-04T09:41:43.3565696Z 2025-12-04T09:41:43.3565797Z async_compile.wait(globals()) 2025-12-04T09:41:43.3565943Z del async_compile 2025-12-04T09:41:43.3565948Z 2025-12-04T09:41:43.3566036Z class Runner: 2025-12-04T09:41:43.3566178Z def __init__(self, partitions): 2025-12-04T09:41:43.3566283Z self.partitions = partitions 2025-12-04T09:41:43.3566287Z 2025-12-04T09:41:43.3566394Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.3566486Z new_callables = [] 2025-12-04T09:41:43.3566607Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.3566712Z new_callables.append(fn(c)) 2025-12-04T09:41:43.3566815Z self.partitions = new_callables 2025-12-04T09:41:43.3566820Z 2025-12-04T09:41:43.3566909Z def call(self, args): 2025-12-04T09:41:43.3566997Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.3567081Z args.clear() 2025-12-04T09:41:43.3567208Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.3567381Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.3567515Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.3567623Z torch.cuda.set_device(0) 2025-12-04T09:41:43.3567801Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.3568025Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.3568120Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.3568314Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.3568396Z del arg0_1 2025-12-04T09:41:43.3568556Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.3568807Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.3568902Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.3569123Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.3569213Z del arg1_1 2025-12-04T09:41:43.3569293Z del buf0 2025-12-04T09:41:43.3569380Z return (buf1, ) 2025-12-04T09:41:43.3569386Z 2025-12-04T09:41:43.3569486Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.3569569Z call = runner.call 2025-12-04T09:41:43.3569732Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.3569736Z 2025-12-04T09:41:43.3569740Z 2025-12-04T09:41:43.3569924Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.3570055Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.3570202Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.3570400Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.3570601Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.3570698Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.3570863Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.3570870Z 2025-12-04T09:41:43.3570874Z 2025-12-04T09:41:43.3570964Z if __name__ == "__main__": 2025-12-04T09:41:43.3571166Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.3571324Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.3571407Z From CHECK: .to( 2025-12-04T09:41:43.3571411Z 2025-12-04T09:41:43.3571415Z 2025-12-04T09:41:43.3571589Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.3572143Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.3572148Z 2025-12-04T09:41:43.3572362Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.3572541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3572639Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3572810Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3573059Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3574467Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3574554Z graph_break [] 2025-12-04T09:41:43.3574658Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3574831Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3574969Z Autotune Choices Stats: 2025-12-04T09:41:43.3575821Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3575920Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3576006Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3576111Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3576598Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3577057Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3577520Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3577980Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3578477Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3578952Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3579408Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3579871Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3580326Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3580795Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3581127Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.3581218Z Autotune Choices Stats: 2025-12-04T09:41:43.3582050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3582401Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3582489Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3582595Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3583136Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3583597Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3584051Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3584509Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3585007Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3585462Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3585921Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3586382Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3586852Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3587367Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3587698Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.3587911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3588007Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3588147Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3588392Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3589333Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3589424Z graph_break [] 2025-12-04T09:41:43.3589528Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3589706Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3589796Z Autotune Choices Stats: 2025-12-04T09:41:43.3590626Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3590722Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3590807Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3590913Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3591382Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3591896Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3592403Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3592883Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3593360Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3593843Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3594348Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3594816Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3595279Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3595741Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3596071Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.3596169Z Autotune Choices Stats: 2025-12-04T09:41:43.3597007Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3597114Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3597264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3597369Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3597837Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3598297Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3598771Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3599237Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3599780Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3600408Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3600869Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3601425Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3601952Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3602433Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3602771Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.3602862Z Autotune Choices Stats: 2025-12-04T09:41:43.3603693Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.3603849Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3603936Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3604051Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3604529Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3604989Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3605445Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3605917Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3606392Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3606916Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3607386Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3607856Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3608341Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3608804Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3609132Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.3609307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3609400Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3609534Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3609780Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3610757Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3610885Z graph_break [] 2025-12-04T09:41:43.3610987Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3611162Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3611256Z Autotune Choices Stats: 2025-12-04T09:41:43.3612079Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.3612174Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3612262Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3612405Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3612879Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3613349Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3613814Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3614275Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3614745Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3615217Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3615684Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3616193Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3616652Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3617162Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3617499Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.3617594Z Autotune Choices Stats: 2025-12-04T09:41:43.3618421Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.3618514Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3618603Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3618706Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3619176Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3619715Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3620175Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3620685Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3621144Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3621605Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3622102Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3622571Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3623041Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3623509Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3623844Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.3623937Z Autotune Choices Stats: 2025-12-04T09:41:43.3624775Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3624869Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3624954Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3625105Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3625583Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3626057Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3626537Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3627001Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3627471Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3627975Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3628437Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3628939Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3629399Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3629909Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3630238Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.3630413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3630505Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3630635Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3630918Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3632297Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3632386Z graph_break [] 2025-12-04T09:41:43.3632488Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3632659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3632752Z Autotune Choices Stats: 2025-12-04T09:41:43.3633586Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3633684Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3633772Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3633875Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3634403Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3634864Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3635325Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3635789Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3636252Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3636724Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3637193Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3637665Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3638172Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3638638Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3639005Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.3639100Z Autotune Choices Stats: 2025-12-04T09:41:43.3639973Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3640065Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3640201Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3640304Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3640784Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3641256Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3641731Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3642195Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3642656Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3643123Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3643592Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3644102Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3644582Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3645055Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3645385Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.3645560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3645653Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3645785Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3646039Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3646979Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3647062Z graph_break [] 2025-12-04T09:41:43.3647169Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3647406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3647510Z Autotune Choices Stats: 2025-12-04T09:41:43.3648406Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.3648500Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3648586Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3648692Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3649168Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3649643Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3650154Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3650628Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3651095Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3651560Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3652031Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3652503Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3653043Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3653516Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3653845Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.3653940Z Autotune Choices Stats: 2025-12-04T09:41:43.3654769Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3654867Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3654953Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3655056Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3655532Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3656004Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3656477Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3656985Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3657522Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3658007Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3658472Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3658947Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3659522Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3659999Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3660324Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.3660415Z Autotune Choices Stats: 2025-12-04T09:41:43.3661254Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3661351Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3661439Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3661549Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3662033Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3662551Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3663016Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3663482Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3663958Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3664425Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3664898Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3665368Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3665841Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3666352Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3666724Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.3666899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3666993Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3667128Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3667370Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3668313Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3668437Z graph_break [] 2025-12-04T09:41:43.3668540Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3668718Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3668809Z Autotune Choices Stats: 2025-12-04T09:41:43.3669635Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3669730Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3669815Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3669921Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3670392Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3670868Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3671342Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3671857Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3672336Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3672813Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3673291Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3673773Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3674237Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3674709Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3675040Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.3675174Z Autotune Choices Stats: 2025-12-04T09:41:43.3676003Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3676138Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3676224Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3676327Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3676799Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3677325Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3677840Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3678311Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3678783Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3679249Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3679789Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3680268Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3680742Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3681253Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3681583Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.3681673Z Autotune Choices Stats: 2025-12-04T09:41:43.3682513Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3682613Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3682701Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3682810Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3683283Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3683756Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3684226Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3684746Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3685258Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3685736Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3686221Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3686686Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3687276Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3687745Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3688079Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.3688252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3688344Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3688475Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3688719Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3690090Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3690175Z graph_break [] 2025-12-04T09:41:43.3690318Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3690495Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3690585Z Autotune Choices Stats: 2025-12-04T09:41:43.3691421Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3691515Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3691610Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3691717Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3692189Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3692664Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3693138Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3693616Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3694137Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3694653Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3695139Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3695622Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3696102Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3696617Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3696953Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.3697049Z Autotune Choices Stats: 2025-12-04T09:41:43.3697890Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3697987Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3698072Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3698175Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3698657Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3699133Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3699644Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3700113Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3700743Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3701220Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3701687Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3702166Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3702639Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3703113Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3703515Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.3703690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3703847Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3703976Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3704226Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3705634Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3705772Z graph_break [] 2025-12-04T09:41:43.3705883Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3706056Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3706149Z Autotune Choices Stats: 2025-12-04T09:41:43.3707017Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3707124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3707225Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3707330Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3707815Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3708299Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3708770Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3709292Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3709759Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3710228Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3710706Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3711173Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3711648Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3712117Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3712449Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.3712543Z Autotune Choices Stats: 2025-12-04T09:41:43.3713423Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3713553Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3713640Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3713748Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3714228Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3714702Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3715218Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3715688Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3716159Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3716623Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3717131Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3717621Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3718093Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3718603Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3718932Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.3719108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3719202Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3719332Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3719633Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3721007Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3721098Z graph_break [] 2025-12-04T09:41:43.3721200Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3721371Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3721467Z Autotune Choices Stats: 2025-12-04T09:41:43.3722341Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.3722441Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3722590Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3722694Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3723177Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3723650Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3724122Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3724636Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3725110Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3725592Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3726066Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3726536Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3727005Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3727473Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3727840Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.3727935Z Autotune Choices Stats: 2025-12-04T09:41:43.3728773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3728866Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3728958Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3729064Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3729541Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3730022Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3730492Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3730965Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3731479Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3731950Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3732457Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3732929Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3733403Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3733919Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3734253Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.3734427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3734522Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3734657Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3734903Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3736291Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3736376Z graph_break [] 2025-12-04T09:41:43.3736479Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3736657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3736747Z Autotune Choices Stats: 2025-12-04T09:41:43.3737672Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3737768Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3737855Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3737962Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3738438Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3738922Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3739396Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3739873Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3740354Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3740869Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3741356Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3741882Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3742354Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3742818Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3743194Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.3743289Z Autotune Choices Stats: 2025-12-04T09:41:43.3744128Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.3744228Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3744314Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3744419Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3744894Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3745359Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3745830Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3746304Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3746815Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3747284Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3747800Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3748276Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3748752Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3749230Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3749559Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.3749730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3749829Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3749996Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3750251Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3751670Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3751757Z graph_break [] 2025-12-04T09:41:43.3751861Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3752033Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3752126Z Autotune Choices Stats: 2025-12-04T09:41:43.3753009Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3753104Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3753191Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3753295Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3753781Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3754252Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3754722Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3755197Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3755704Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3756176Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3756647Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3757127Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3757599Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3758073Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3758406Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.3758495Z Autotune Choices Stats: 2025-12-04T09:41:43.3759341Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3759551Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3759638Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3759747Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3760262Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3760739Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3761209Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3761670Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3762183Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3762651Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3763119Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3763589Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3764060Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3764536Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3764865Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.3765079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3765178Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3765313Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3765558Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3766497Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3766585Z graph_break [] 2025-12-04T09:41:43.3766688Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3766867Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3766958Z Autotune Choices Stats: 2025-12-04T09:41:43.3767841Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.3767937Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3768021Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3768123Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3768641Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3769113Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3769625Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3770091Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3770570Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3771043Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3771555Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3772042Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3772520Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3772997Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3773330Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.3773430Z Autotune Choices Stats: 2025-12-04T09:41:43.3774264Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3774399Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3774494Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3774595Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3775062Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3775534Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3776008Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3776480Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3776948Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3777419Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3777935Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3778451Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3778962Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3779434Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3779766Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.3779856Z Autotune Choices Stats: 2025-12-04T09:41:43.3780700Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3780844Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3780930Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3781042Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3781521Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3781996Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3782471Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3782956Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3783442Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3783960Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3784442Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3784924Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3785411Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3785896Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3786230Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.3786409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3786504Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3786637Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3786876Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3787862Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3787986Z graph_break [] 2025-12-04T09:41:43.3788090Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3788267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3788360Z Autotune Choices Stats: 2025-12-04T09:41:43.3789189Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3789283Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3789409Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3789512Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3789990Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3790470Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3790958Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3791438Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3791915Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3792389Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3792921Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3793392Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3793863Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3794342Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3794678Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.3794772Z Autotune Choices Stats: 2025-12-04T09:41:43.3795605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3795700Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3795790Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3795896Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3796365Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3796880Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3797396Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3797907Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3798373Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3798842Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3799358Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3799888Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3800498Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3800970Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3801304Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.3801399Z Autotune Choices Stats: 2025-12-04T09:41:43.3802251Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3802345Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3802500Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3802616Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3803089Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3803559Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3804040Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3804524Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3805007Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3805482Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3805968Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3806506Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3807087Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3807606Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3807936Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.3808112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3808204Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3808389Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3808635Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3809572Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3809667Z graph_break [] 2025-12-04T09:41:43.3809771Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3809944Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3810039Z Autotune Choices Stats: 2025-12-04T09:41:43.3810873Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3810973Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3811058Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3811167Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3811652Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3812169Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3812651Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3813127Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3813615Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3814094Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3814561Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3815036Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3815543Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3816019Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3816387Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.3816483Z Autotune Choices Stats: 2025-12-04T09:41:43.3817317Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.3817409Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3817501Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3817645Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3818120Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3818599Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3819066Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3819534Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3819999Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3820471Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3820939Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3821449Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3821926Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3822396Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3822735Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.3822829Z Autotune Choices Stats: 2025-12-04T09:41:43.3823667Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3823767Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3823852Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3823962Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3824439Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3824949Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3825458Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3825926Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3826397Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3826863Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3827457Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3827931Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3828403Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3828881Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3829208Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.3829385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3829480Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3829610Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3829859Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3831273Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3834731Z graph_break [] 2025-12-04T09:41:43.3834854Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3835054Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3835159Z Autotune Choices Stats: 2025-12-04T09:41:43.3836160Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3836263Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3836354Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3836463Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3837029Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3837639Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3838272Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3838787Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3839256Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3839788Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3840260Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3840775Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3841248Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3841726Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3842053Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.3842143Z Autotune Choices Stats: 2025-12-04T09:41:43.3842979Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3843073Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3843163Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3843267Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3843786Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3844258Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3844726Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3845205Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3845670Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3846145Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3846612Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3847084Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3847612Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3848119Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3848455Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.3848629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3848721Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3848852Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3849097Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3850480Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3850604Z graph_break [] 2025-12-04T09:41:43.3850705Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3850884Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3850974Z Autotune Choices Stats: 2025-12-04T09:41:43.3851826Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.3851922Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3852006Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3852115Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3852594Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3853116Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3853584Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3854051Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3854524Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3854995Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3855478Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3855948Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3856418Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3856930Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3857306Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.3857422Z Autotune Choices Stats: 2025-12-04T09:41:43.3858287Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3858381Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3858467Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3858569Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3859047Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3859553Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3860026Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3860500Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3860963Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3861434Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3861900Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3862414Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3862889Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3863363Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3863693Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.3863867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3863965Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3864094Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3864340Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3865725Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3865807Z graph_break [] 2025-12-04T09:41:43.3865913Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3866123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3866217Z Autotune Choices Stats: 2025-12-04T09:41:43.3867151Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3867256Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3867353Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3867456Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3867935Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3868414Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3868935Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3869418Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3869896Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3870374Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3870848Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3871316Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3871822Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3872287Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3872619Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.3872711Z Autotune Choices Stats: 2025-12-04T09:41:43.3873565Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3873659Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3873743Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3873849Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3874324Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3874794Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3875258Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3875763Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3876267Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3876733Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3877200Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3877672Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3878187Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3878662Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3878991Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.3879164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3879258Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3879390Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3879715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3881089Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3881220Z graph_break [] 2025-12-04T09:41:43.3881327Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3881503Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3881592Z Autotune Choices Stats: 2025-12-04T09:41:43.3882422Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3882522Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3882606Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3882712Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3883184Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3883667Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3884143Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3884617Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3885159Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3885680Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3886145Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3886615Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3887081Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3887639Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3887973Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.3888067Z Autotune Choices Stats: 2025-12-04T09:41:43.3888894Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3888986Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3889073Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3889178Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3889661Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3890134Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3890643Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3891117Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3891584Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3892062Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3892529Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3893004Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3893475Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3893948Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3894321Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.3894531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3894626Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3894756Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3895002Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3895948Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3896029Z graph_break [] 2025-12-04T09:41:43.3896170Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3896347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3896436Z Autotune Choices Stats: 2025-12-04T09:41:43.3897306Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.3897415Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3897508Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3897615Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3898094Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3898565Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3899033Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3899546Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3900016Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3900652Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3901127Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3901602Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3902075Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3902546Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3902877Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.3902970Z Autotune Choices Stats: 2025-12-04T09:41:43.3903892Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.3904038Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3904123Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3904226Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3904701Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3905177Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3905644Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3906165Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3906633Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3907100Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3907563Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3908040Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3908512Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3909040Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3909370Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.3909461Z Autotune Choices Stats: 2025-12-04T09:41:43.3910290Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3910385Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3910475Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3910584Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3911059Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3911537Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3912009Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3912480Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3912998Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3913512Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3913993Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3914476Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3914950Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3915457Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3915792Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.3915966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3916060Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3916193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3916438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3917870Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3917957Z graph_break [] 2025-12-04T09:41:43.3918062Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3918233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3918323Z Autotune Choices Stats: 2025-12-04T09:41:43.3919206Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.3919299Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3919383Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3919570Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3920060Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3920537Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3921011Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3921480Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3921959Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3922468Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3922980Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3923451Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3923926Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3924397Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3924770Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.3924865Z Autotune Choices Stats: 2025-12-04T09:41:43.3925707Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3925801Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3925885Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3925987Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3926466Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3926944Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3927435Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3927972Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3928442Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3928906Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3929375Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3929854Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3930327Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3930801Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3931127Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.3931300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3931434Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3931565Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3931852Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3933235Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3933320Z graph_break [] 2025-12-04T09:41:43.3933421Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3933593Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3933725Z Autotune Choices Stats: 2025-12-04T09:41:43.3934564Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3934658Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3934751Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3934854Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3935336Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3935812Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3936283Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3936762Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3937305Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.3937776Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3938241Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3938713Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3939180Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3939651Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3939985Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.3940076Z Autotune Choices Stats: 2025-12-04T09:41:43.3940956Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3941051Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3941173Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3941277Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3941758Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3942228Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3942694Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3943203Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3943672Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3944140Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3944606Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3945076Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3945558Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3946029Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3946396Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.3946572Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3946664Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3946796Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3947040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3948462Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3948552Z graph_break [] 2025-12-04T09:41:43.3948657Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3948835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3948925Z Autotune Choices Stats: 2025-12-04T09:41:43.3949750Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3949845Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3949930Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3950076Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3950551Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3951070Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3951548Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3952023Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3952548Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3953012Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3953483Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3953953Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3954424Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3954899Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3955229Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.3955322Z Autotune Choices Stats: 2025-12-04T09:41:43.3956182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3956274Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3956362Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3956465Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3956943Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3957469Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3957938Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3958405Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3958869Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3959432Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3959950Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3960470Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3960941Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3961411Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3961785Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.3961959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3962058Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3962187Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3962430Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3963815Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3963896Z graph_break [] 2025-12-04T09:41:43.3964005Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3964178Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3964268Z Autotune Choices Stats: 2025-12-04T09:41:43.3965160Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.3965254Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3965342Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3965443Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3965916Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3966386Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3966856Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3967329Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3967795Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3968270Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3968783Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3969257Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3969772Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3970248Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3970580Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.3970670Z Autotune Choices Stats: 2025-12-04T09:41:43.3971578Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3971675Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3971759Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3971862Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3972340Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3972812Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3973291Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3973762Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3974236Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3974741Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3975211Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3975686Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3976159Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3976644Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3976975Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.3977155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3977250Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3977410Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3977686Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.3978664Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.3978788Z graph_break [] 2025-12-04T09:41:43.3978891Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.3979069Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.3979164Z Autotune Choices Stats: 2025-12-04T09:41:43.3979992Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3980126Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3980211Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3980316Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.3980803Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3981278Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3981753Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3982221Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3982705Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3983192Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3983705Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3984175Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3984644Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3985116Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3985450Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.3985542Z Autotune Choices Stats: 2025-12-04T09:41:43.3986379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3986470Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3986561Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3986666Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.3987137Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3987697Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3988202Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.3988682Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3989147Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3989615Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3990131Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3990614Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3991089Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3991560Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3991894Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.3991987Z Autotune Choices Stats: 2025-12-04T09:41:43.3992808Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.3992946Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.3993033Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.3993148Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.3993622Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.3994093Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3994570Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.3995042Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.3995525Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3996000Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3996477Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3997007Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.3997525Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3998007Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.3998338Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.3998512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.3998647Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.3998778Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.3999030Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4000758Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4000845Z graph_break [] 2025-12-04T09:41:43.4000947Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4001119Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4001212Z Autotune Choices Stats: 2025-12-04T09:41:43.4002050Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4002145Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4002230Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4002334Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4002881Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4003357Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4003829Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4004306Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4004780Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4005258Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4005732Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4006257Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4006736Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4007345Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4007676Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.4007766Z Autotune Choices Stats: 2025-12-04T09:41:43.4008599Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4008744Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4008838Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4008940Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4009421Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4009897Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4010368Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4010842Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4011315Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4011781Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4012298Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4012771Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4013247Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4013723Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4014057Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.4014234Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4014328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4014459Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4014701Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4015641Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4015766Z graph_break [] 2025-12-04T09:41:43.4015871Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4016085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4016175Z Autotune Choices Stats: 2025-12-04T09:41:43.4017026Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4017137Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4017233Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4017344Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4017816Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4018336Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4018813Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4019288Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4019770Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4020248Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4020735Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4021240Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4021720Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4022191Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4022521Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.4022618Z Autotune Choices Stats: 2025-12-04T09:41:43.4023443Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4023538Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4023631Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4023734Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4024209Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4024677Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4025185Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4025696Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4026165Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4026636Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4027100Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4027619Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4028093Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4028563Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4028894Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.4028984Z Autotune Choices Stats: 2025-12-04T09:41:43.4029820Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.4029916Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4030002Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4030114Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4030626Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4031099Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4031570Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4032048Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4032515Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4032980Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4033451Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4033922Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4034446Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4034953Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4035289Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.4035463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4035559Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4035692Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4035934Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4037378Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4037477Z graph_break [] 2025-12-04T09:41:43.4037594Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4037768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4037861Z Autotune Choices Stats: 2025-12-04T09:41:43.4038690Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4038788Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4038875Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4038982Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4039457Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4040038Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4040513Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4040983Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4041462Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4041928Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4042396Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4042862Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4043334Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4043867Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4044235Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.4044333Z Autotune Choices Stats: 2025-12-04T09:41:43.4045174Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4045266Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4045354Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4045457Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4045977Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4046450Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4046925Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4047424Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4047915Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4048390Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4048855Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4049373Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4049845Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4050318Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4050651Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.4050822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4050924Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4051053Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4051294Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4052677Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4052760Z graph_break [] 2025-12-04T09:41:43.4052871Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4053084Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4053178Z Autotune Choices Stats: 2025-12-04T09:41:43.4054065Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4054159Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4054249Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4054354Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4054826Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4055350Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4055829Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4056305Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4056775Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4057247Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4057726Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4058192Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4058710Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4059187Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4059519Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.4059613Z Autotune Choices Stats: 2025-12-04T09:41:43.4060451Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4060552Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4060638Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4060744Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4061225Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4061700Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4062208Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4062679Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4063192Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4063659Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4064128Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4064642Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4065114Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4065592Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4065919Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.4066145Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.4066247Z Traceback (most recent call last): 2025-12-04T09:41:43.4066664Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.4066857Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.4067205Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.4067406Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.4067591Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.4067684Z Searched string: 2025-12-04T09:41:43.4067860Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.4067866Z 2025-12-04T09:41:43.4067985Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.4067989Z 2025-12-04T09:41:43.4068116Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.4068244Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.4068248Z 2025-12-04T09:41:43.4068343Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.4068440Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.4068531Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.4068621Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.4068626Z 2025-12-04T09:41:43.4068720Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.4068812Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.4068902Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.4068995Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.4068999Z 2025-12-04T09:41:43.4069003Z 2025-12-04T09:41:43.4069164Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.4069168Z 2025-12-04T09:41:43.4069172Z 2025-12-04T09:41:43.4069295Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.4069407Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.4069518Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.4069603Z idx_m = rm[:, None] 2025-12-04T09:41:43.4069687Z idx_n = rn[None, :] 2025-12-04T09:41:43.4069786Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.4069790Z 2025-12-04T09:41:43.4069927Z # inductor generates a suffix 2025-12-04T09:41:43.4070022Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.4070247Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.4070374Z ''', device_str='cuda') 2025-12-04T09:41:43.4070379Z 2025-12-04T09:41:43.4070382Z 2025-12-04T09:41:43.4070481Z async_compile.wait(globals()) 2025-12-04T09:41:43.4070565Z del async_compile 2025-12-04T09:41:43.4070572Z 2025-12-04T09:41:43.4070651Z class Runner: 2025-12-04T09:41:43.4070758Z def __init__(self, partitions): 2025-12-04T09:41:43.4070860Z self.partitions = partitions 2025-12-04T09:41:43.4070864Z 2025-12-04T09:41:43.4070972Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.4071068Z new_callables = [] 2025-12-04T09:41:43.4071183Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.4071286Z new_callables.append(fn(c)) 2025-12-04T09:41:43.4071432Z self.partitions = new_callables 2025-12-04T09:41:43.4071436Z 2025-12-04T09:41:43.4071526Z def call(self, args): 2025-12-04T09:41:43.4071618Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.4071699Z args.clear() 2025-12-04T09:41:43.4071828Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.4071958Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.4072068Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.4072166Z torch.cuda.set_device(0) 2025-12-04T09:41:43.4072340Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.4072558Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.4072654Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.4072851Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.4072935Z del arg0_1 2025-12-04T09:41:43.4073102Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.4073357Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.4073455Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.4073675Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.4073757Z del arg1_1 2025-12-04T09:41:43.4073834Z del buf0 2025-12-04T09:41:43.4073988Z return (buf1, ) 2025-12-04T09:41:43.4073993Z 2025-12-04T09:41:43.4074096Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.4074187Z call = runner.call 2025-12-04T09:41:43.4074344Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.4074349Z 2025-12-04T09:41:43.4074352Z 2025-12-04T09:41:43.4074489Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.4074623Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.4074772Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.4074977Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.4075180Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.4075281Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.4075446Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.4075450Z 2025-12-04T09:41:43.4075454Z 2025-12-04T09:41:43.4075542Z if __name__ == "__main__": 2025-12-04T09:41:43.4075745Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.4075906Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.4075987Z From CHECK: .to( 2025-12-04T09:41:43.4075991Z 2025-12-04T09:41:43.4075995Z 2025-12-04T09:41:43.4076169Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.4076757Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.4076765Z 2025-12-04T09:41:43.4076987Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.4077201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4077294Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4077436Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4077724Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4079097Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4079228Z graph_break [] 2025-12-04T09:41:43.4079332Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4079561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4079653Z Autotune Choices Stats: 2025-12-04T09:41:43.4080498Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4080594Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4080680Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4080784Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4081274Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4081741Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4082213Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4082713Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4083175Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4083645Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4084108Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4084570Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4085028Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4085493Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4085822Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.4085915Z Autotune Choices Stats: 2025-12-04T09:41:43.4086790Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4086919Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4087012Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4087115Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4087597Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4088058Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4088551Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4089019Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4089480Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4089942Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4090397Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4090869Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4091338Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4091839Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4092173Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.4092347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4092439Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4092572Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4092825Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4093773Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4097318Z graph_break [] 2025-12-04T09:41:43.4097447Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4097636Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4097728Z Autotune Choices Stats: 2025-12-04T09:41:43.4098564Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4098727Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4098819Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4098931Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4099452Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4099928Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4100561Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4101041Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4101602Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4102084Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4102556Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4103027Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4103499Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4103966Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4104303Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.4104400Z Autotune Choices Stats: 2025-12-04T09:41:43.4105342Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4105441Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4105530Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4105637Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4106116Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4106581Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4107104Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4107572Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4108036Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4108561Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4109076Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4109550Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4110020Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4110497Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4110897Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.4110992Z Autotune Choices Stats: 2025-12-04T09:41:43.4111834Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.4111932Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4112025Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4112137Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4112614Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4113084Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4113549Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4114065Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4114543Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4115015Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4115487Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4115956Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4116429Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4116891Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4117228Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.4117403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4117501Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4117680Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4118062Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4119053Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4119139Z graph_break [] 2025-12-04T09:41:43.4119243Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4119422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4119585Z Autotune Choices Stats: 2025-12-04T09:41:43.4120440Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.4120586Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4120674Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4120781Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4121263Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4121730Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4122198Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4122667Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4123142Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4123652Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4124123Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4124593Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4125059Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4125527Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4125860Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.4125958Z Autotune Choices Stats: 2025-12-04T09:41:43.4126778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.4126879Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4126967Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4127111Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4127637Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4128147Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4128611Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4129076Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4129579Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4130046Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4130514Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4130990Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4131459Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4131934Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4132264Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.4132361Z Autotune Choices Stats: 2025-12-04T09:41:43.4133234Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4133330Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4133416Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4133530Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4134007Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4134491Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4134969Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4135436Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4135899Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4136401Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4136867Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4137393Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4137888Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4138359Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4138696Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.4138910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4139004Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4139143Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4139391Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4140783Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4140872Z graph_break [] 2025-12-04T09:41:43.4140976Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4141157Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4141251Z Autotune Choices Stats: 2025-12-04T09:41:43.4142083Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4142226Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4142315Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4142422Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4142908Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4143369Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4143843Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4144306Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4144774Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4145242Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4145757Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4146233Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4146767Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4147233Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4147566Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.4147664Z Autotune Choices Stats: 2025-12-04T09:41:43.4148502Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4148637Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4148727Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4148833Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4149314Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4149779Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4150253Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4150724Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4151191Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4151697Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4152167Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4152642Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4153117Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4153592Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4153929Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.4154104Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4154202Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4154333Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4154580Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4155564Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4155685Z graph_break [] 2025-12-04T09:41:43.4155796Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4155976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4156070Z Autotune Choices Stats: 2025-12-04T09:41:43.4156909Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.4157004Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4157132Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4157241Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4157773Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4158254Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4158727Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4159199Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4159709Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4160181Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4160653Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4161167Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4161645Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4162119Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4162458Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.4162555Z Autotune Choices Stats: 2025-12-04T09:41:43.4163386Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4163484Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4163570Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4163677Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4164151Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4164669Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4165183Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4165652Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4166123Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4166589Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4167106Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4167614Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4168114Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4168591Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4168922Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.4169020Z Autotune Choices Stats: 2025-12-04T09:41:43.4169858Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4169956Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4170085Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4170197Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4170685Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4171161Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4171633Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4172108Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4172583Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4173052Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4173518Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4174035Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4174544Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4175019Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4175353Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.4175527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4175625Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4175797Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4176046Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4176989Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4177074Z graph_break [] 2025-12-04T09:41:43.4177185Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4177357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4177449Z Autotune Choices Stats: 2025-12-04T09:41:43.4178302Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4178400Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4178486Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4178593Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4179068Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4179586Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4180061Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4180543Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4181025Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4181507Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4181981Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4182461Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4183002Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4183481Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4183854Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.4183946Z Autotune Choices Stats: 2025-12-04T09:41:43.4184775Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4184872Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4184960Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4185067Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4185587Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4186068Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4186546Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4187018Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4187518Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4188013Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4188482Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4188998Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4189473Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4189948Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4190286Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.4190383Z Autotune Choices Stats: 2025-12-04T09:41:43.4191214Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4191307Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4191398Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4191509Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4191983Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4192499Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4192975Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4193498Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4193974Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4194454Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4194981Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4195454Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4195931Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4196399Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4196736Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.4196912Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4197012Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4197145Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4197393Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4198872Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4198955Z graph_break [] 2025-12-04T09:41:43.4199062Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4199236Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4199333Z Autotune Choices Stats: 2025-12-04T09:41:43.4200228Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4200451Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4200539Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4200648Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4201125Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4201602Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4202148Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4202685Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4203168Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4203647Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4204129Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4204670Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4205156Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4205632Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4205967Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.4206060Z Autotune Choices Stats: 2025-12-04T09:41:43.4206909Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4207009Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4207096Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4207202Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4207741Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4208217Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4208689Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4209163Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4209634Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4210110Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4210576Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4211051Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4211566Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4212079Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4212409Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.4212586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4212680Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4212812Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4213062Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4214448Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4214578Z graph_break [] 2025-12-04T09:41:43.4214683Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4214860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4214956Z Autotune Choices Stats: 2025-12-04T09:41:43.4215801Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4215898Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4215985Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4216092Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4216581Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4217119Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4217624Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4218112Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4218583Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4219051Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4219528Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4219998Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4220469Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4220980Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4221358Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.4221452Z Autotune Choices Stats: 2025-12-04T09:41:43.4222304Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4222397Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4222482Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4222588Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4223070Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4223586Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4224064Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4224538Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4225004Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4225474Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4225945Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4226460Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4226937Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4227409Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4227766Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.4227969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4228068Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4228203Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4228448Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4229822Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4229908Z graph_break [] 2025-12-04T09:41:43.4230014Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4230231Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4230324Z Autotune Choices Stats: 2025-12-04T09:41:43.4231210Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.4231307Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4231394Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4231502Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4231978Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4232454Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4232966Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4233447Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4233925Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4234403Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4234887Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4235355Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4235863Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4236335Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4236677Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.4236774Z Autotune Choices Stats: 2025-12-04T09:41:43.4237620Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4237719Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4237806Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4237915Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4238403Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4238880Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4239353Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4239923Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4240430Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4240907Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4241375Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4241860Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4242370Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4242847Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4243183Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.4243358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4243458Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4243591Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4243838Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4245228Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4245354Z graph_break [] 2025-12-04T09:41:43.4245464Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4245638Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4245731Z Autotune Choices Stats: 2025-12-04T09:41:43.4246567Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4246667Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4246759Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4246872Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4247372Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4247878Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4248352Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4248839Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4249360Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4249882Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4250368Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4250845Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4251323Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4251907Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4252248Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.4252343Z Autotune Choices Stats: 2025-12-04T09:41:43.4253188Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.4253292Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4253379Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4253492Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4253973Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4254445Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4254965Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4255439Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4255910Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4256382Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4256854Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4257328Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4257850Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4258328Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4258701Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.4258917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4259014Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4259147Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4259400Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4260774Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4260901Z graph_break [] 2025-12-04T09:41:43.4261009Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4261186Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4261289Z Autotune Choices Stats: 2025-12-04T09:41:43.4262137Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4262236Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4262324Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4262430Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4262919Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4263398Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4263876Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4264391Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4264863Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4265333Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4265813Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4266294Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4266777Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4267256Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4267590Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.4267688Z Autotune Choices Stats: 2025-12-04T09:41:43.4268569Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4268702Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4268794Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4268900Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4269390Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4269880Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4270404Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4270879Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4271348Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4271823Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4272290Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4272770Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4273249Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4273762Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4274098Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.4274274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4274371Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4274511Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4274759Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4275710Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4275798Z graph_break [] 2025-12-04T09:41:43.4275904Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4276083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4276177Z Autotune Choices Stats: 2025-12-04T09:41:43.4277014Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.4277153Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4277265Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4277381Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4277918Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4278392Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4278863Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4279330Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4279913Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4280389Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4280869Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4281347Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4281828Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4282302Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4282637Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.4282734Z Autotune Choices Stats: 2025-12-04T09:41:43.4283601Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4283702Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4283791Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4283899Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4284379Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4284856Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4285333Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4285802Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4286269Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4286788Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4287329Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4287852Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4288328Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4288805Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4289178Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.4289276Z Autotune Choices Stats: 2025-12-04T09:41:43.4290116Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4290211Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4290301Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4290413Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4290897Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4291389Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4291865Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4292391Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4292870Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4293350Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4293840Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4294319Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4294808Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4295288Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4295624Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.4295803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4295947Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4296088Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4296375Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4297372Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4297456Z graph_break [] 2025-12-04T09:41:43.4297564Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4297741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4297834Z Autotune Choices Stats: 2025-12-04T09:41:43.4298667Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4298810Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4298899Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4299009Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4299490Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4299970Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4300609Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4301096Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4301574Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4302115Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4302589Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4303058Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4303530Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4304007Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4304346Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.4304445Z Autotune Choices Stats: 2025-12-04T09:41:43.4305272Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4305379Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4305527Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4305637Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4306172Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4306649Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4307121Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4307644Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4308727Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4309202Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4309672Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4310152Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4310631Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4311119Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4311451Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.4311545Z Autotune Choices Stats: 2025-12-04T09:41:43.4312428Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4312525Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4312615Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4312733Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4313214Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4313694Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4314171Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4314654Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4315132Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4315654Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4316182Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4316663Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4317142Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4317639Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4318043Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.4318221Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4318320Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4318460Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4318712Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4319698Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4319789Z graph_break [] 2025-12-04T09:41:43.4319899Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4320088Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4320184Z Autotune Choices Stats: 2025-12-04T09:41:43.4321037Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4321138Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4321270Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4321386Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4321868Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4322346Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4322835Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4323319Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4323815Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4324292Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4324767Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4325308Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4325816Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4326298Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4326631Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.4326733Z Autotune Choices Stats: 2025-12-04T09:41:43.4327572Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.4327709Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4327804Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4327911Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4328390Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4328866Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4329337Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4329824Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4330294Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4330810Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4331281Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4331762Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4332243Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4332721Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4333060Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.4333156Z Autotune Choices Stats: 2025-12-04T09:41:43.4334005Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4334107Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4334198Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4334363Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4334847Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4335365Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4335835Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4336306Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4336829Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4337323Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4337831Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4338308Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4338790Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4339274Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4339605Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.4339788Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4339886Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4340067Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4340317Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4341707Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4341799Z graph_break [] 2025-12-04T09:41:43.4341911Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4342093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4342191Z Autotune Choices Stats: 2025-12-04T09:41:43.4343027Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4343130Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4343220Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4343333Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4343852Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4344340Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4344882Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4345360Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4345835Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4346345Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4346827Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4347312Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4347827Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4348302Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4348643Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.4348743Z Autotune Choices Stats: 2025-12-04T09:41:43.4349620Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4349720Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4349813Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4349920Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4350412Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4350885Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4351357Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4351838Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4352307Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4352783Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4353291Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4353776Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4354292Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4354765Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4355102Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.4355279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4355422Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4355562Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4355812Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4357209Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4357294Z graph_break [] 2025-12-04T09:41:43.4357405Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4357580Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4357676Z Autotune Choices Stats: 2025-12-04T09:41:43.4358541Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.4358640Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4358731Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4358838Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4359385Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4359914Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4360392Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4360871Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4361349Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4361825Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4362303Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4362822Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4363298Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4363815Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4364154Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.4364250Z Autotune Choices Stats: 2025-12-04T09:41:43.4365079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4365221Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4365310Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4365419Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4365897Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4366368Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4366843Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4367327Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4367851Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4368323Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4368837Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4369315Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4369793Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4370273Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4370608Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.4370790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4370889Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4371023Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4371276Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4372698Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4372830Z graph_break [] 2025-12-04T09:41:43.4372936Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4373113Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4373217Z Autotune Choices Stats: 2025-12-04T09:41:43.4374057Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4374155Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4374244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4374389Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4374884Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4375366Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4375853Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4376329Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4380568Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4381074Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4381550Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4382089Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4382558Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4383031Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4383365Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.4383460Z Autotune Choices Stats: 2025-12-04T09:41:43.4384306Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4384399Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4384489Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4384594Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4385070Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4385588Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4386096Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4386569Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4387036Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4387501Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4388043Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4388522Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4389001Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4389474Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4389807Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.4389988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4390084Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4390222Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4390471Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4391892Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4391977Z graph_break [] 2025-12-04T09:41:43.4392082Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4392265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4392358Z Autotune Choices Stats: 2025-12-04T09:41:43.4393193Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4393291Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4393380Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4393490Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4393968Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4394454Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4394975Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4395492Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4395979Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4396459Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4396933Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4397519Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4397996Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4398463Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4398797Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.4398895Z Autotune Choices Stats: 2025-12-04T09:41:43.4399803Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4399906Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4399992Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4400096Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4400829Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4401306Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4401786Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4402267Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4402741Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4403224Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4403690Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4404168Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4404706Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4405274Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4405604Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.4405779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4405879Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4406010Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4406255Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4407262Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4407348Z graph_break [] 2025-12-04T09:41:43.4407461Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4407664Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4407773Z Autotune Choices Stats: 2025-12-04T09:41:43.4408618Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.4408709Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4408797Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4408904Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4409391Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4409868Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4410382Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4410866Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4411331Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4411819Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4412288Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4412760Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4413230Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4413744Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4414083Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.4414215Z Autotune Choices Stats: 2025-12-04T09:41:43.4415062Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4415163Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4415251Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4415359Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4415830Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4416357Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4416829Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4417297Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4417766Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4418230Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4418703Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4419177Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4419688Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4420166Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4420496Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.4420596Z Autotune Choices Stats: 2025-12-04T09:41:43.4421425Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4421523Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4421609Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4421721Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4422202Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4422674Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4423188Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4423666Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4424189Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4424670Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4425150Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4425680Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4426156Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4426629Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4426958Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.4427134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4427231Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4427380Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4427665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4429079Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4429167Z graph_break [] 2025-12-04T09:41:43.4429275Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4429449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4429543Z Autotune Choices Stats: 2025-12-04T09:41:43.4430392Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.4430489Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4430580Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4430687Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4431170Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4431649Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4432122Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4432642Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4433181Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4433656Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4434128Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4434595Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4435114Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4435589Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4435927Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.4436021Z Autotune Choices Stats: 2025-12-04T09:41:43.4436859Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4436958Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4437043Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4437155Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4437685Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4438198Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4438677Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4439142Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4439666Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4440136Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4440612Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4441086Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4441557Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4442075Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4442448Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.4442630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4442727Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4442862Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4443115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4444499Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4444635Z graph_break [] 2025-12-04T09:41:43.4444746Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4444921Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4445018Z Autotune Choices Stats: 2025-12-04T09:41:43.4445863Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4445959Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4446047Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4446151Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4446637Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4447119Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4447640Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4448115Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4448595Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4449079Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4449546Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4450024Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4450491Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4450961Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4451340Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.4451435Z Autotune Choices Stats: 2025-12-04T09:41:43.4452283Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4452419Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4452508Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4452613Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4453091Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4453566Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4454076Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4454552Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4455019Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4455488Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4455957Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4456433Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4456960Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4457457Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4457823Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.4457997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4458095Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4458230Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4458481Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4459873Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4459956Z graph_break [] 2025-12-04T09:41:43.4460063Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4460248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4460342Z Autotune Choices Stats: 2025-12-04T09:41:43.4461221Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4461355Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4461445Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4461562Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4462044Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4462532Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4463014Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4463535Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4464025Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4464496Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4464978Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4465462Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4465943Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4466452Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4466787Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.4466885Z Autotune Choices Stats: 2025-12-04T09:41:43.4467778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4467884Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4467974Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4468083Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4468568Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4469049Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4469522Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4469991Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4470534Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4471045Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4471519Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4472002Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4472480Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4472998Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4473336Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.4473515Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4473614Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4473748Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4473996Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4475374Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4475467Z graph_break [] 2025-12-04T09:41:43.4475574Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4475751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4475885Z Autotune Choices Stats: 2025-12-04T09:41:43.4476722Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4476818Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4476914Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4477021Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4477505Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4477980Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4478450Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4478922Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4479391Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4479965Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4480478Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4480960Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4481436Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4481917Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4482296Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.4482393Z Autotune Choices Stats: 2025-12-04T09:41:43.4483243Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4483341Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4483428Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4483537Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4484023Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4484508Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4484983Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4485490Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4485964Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4486431Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4486906Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4487381Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4487909Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4488380Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4488709Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.4488946Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4489045Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4489181Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4489471Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4490416Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4490507Z graph_break [] 2025-12-04T09:41:43.4490612Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4490789Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4490882Z Autotune Choices Stats: 2025-12-04T09:41:43.4491757Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4491858Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4491947Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4492051Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4492533Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4493007Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4493488Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4493964Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4494445Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4494967Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4495440Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4495913Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4496387Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4496862Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4497232Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.4497342Z Autotune Choices Stats: 2025-12-04T09:41:43.4498164Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4498298Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4498394Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4498498Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4499008Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4499481Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4499948Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4500562Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4501107Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4501582Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4502049Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4502524Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4503000Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4503486Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4503823Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.4503920Z Autotune Choices Stats: 2025-12-04T09:41:43.4504837Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4504934Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4505023Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4505139Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4505619Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4506097Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4506571Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4507043Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4507523Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4508058Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4508590Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4509072Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4509558Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4510031Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4510407Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.4510586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4510680Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4510812Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4511063Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4512436Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4512528Z graph_break [] 2025-12-04T09:41:43.4512637Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4512815Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4512910Z Autotune Choices Stats: 2025-12-04T09:41:43.4513797Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4513895Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4513983Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4514086Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4514563Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4515040Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4515517Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4515990Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4516465Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4516938Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4517504Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4518015Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4518489Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4518958Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4519290Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.4519424Z Autotune Choices Stats: 2025-12-04T09:41:43.4520365Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4520463Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4520563Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4520676Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4521164Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4521644Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4522125Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4522607Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4523127Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4523600Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4524072Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4524550Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4525033Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4525517Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4525862Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.4526036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4526139Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4526273Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4526568Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4527566Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4527694Z graph_break [] 2025-12-04T09:41:43.4527809Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4527990Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4528084Z Autotune Choices Stats: 2025-12-04T09:41:43.4528928Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4529066Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4529158Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4529271Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4529755Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4530245Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4530722Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4531194Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4531689Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4532174Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4532709Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4533183Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4533669Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4534144Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4534484Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.4534584Z Autotune Choices Stats: 2025-12-04T09:41:43.4535437Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4535542Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4535632Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4535743Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4536266Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4536737Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4537305Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4537780Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4538260Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4538798Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4539268Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4539750Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4540225Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4540707Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4541044Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.4541141Z Autotune Choices Stats: 2025-12-04T09:41:43.4542028Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.4542126Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4542222Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4542337Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4542811Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4543287Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4543764Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4544246Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4544716Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4545192Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4545705Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4546185Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4546709Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4547183Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4547568Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.4547742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4547880Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4548019Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4548271Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4549664Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4549749Z graph_break [] 2025-12-04T09:41:43.4549856Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4550038Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4550135Z Autotune Choices Stats: 2025-12-04T09:41:43.4550975Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4551070Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4551159Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4551310Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4551791Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4552287Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4552769Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4553246Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4553728Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4554197Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4554669Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4555218Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4555736Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4556207Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4556541Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.4556642Z Autotune Choices Stats: 2025-12-04T09:41:43.4557483Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4557645Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4557741Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4557873Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4558362Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4558844Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4559318Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4559835Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4560320Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4560835Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4561310Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4561794Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4562272Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4562754Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4563090Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.4563270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4563379Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4563516Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4563772Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4565187Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4565314Z graph_break [] 2025-12-04T09:41:43.4565428Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4565614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4565715Z Autotune Choices Stats: 2025-12-04T09:41:43.4566555Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4566651Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4566786Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4566895Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4567385Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4567926Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4568409Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4568880Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4569360Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4569840Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4570358Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4570835Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4571311Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4571803Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4572141Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.4572244Z Autotune Choices Stats: 2025-12-04T09:41:43.4573092Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4573189Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4573280Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4573394Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4573873Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4574397Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4574930Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4575402Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4575876Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4576343Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4576863Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4577399Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4577880Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4578354Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4578683Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.4578868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4578965Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4579108Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4579357Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4580777Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4580869Z graph_break [] 2025-12-04T09:41:43.4580974Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4581160Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4581255Z Autotune Choices Stats: 2025-12-04T09:41:43.4582087Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.4582192Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4582281Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4582391Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4582868Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4583338Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4583854Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4584372Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4584855Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4585325Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4585813Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4586334Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4586812Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4587289Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4587626Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.4587744Z Autotune Choices Stats: 2025-12-04T09:41:43.4588617Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4588720Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4588813Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4588918Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4589444Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4589919Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4590387Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4590866Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4591338Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4591814Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4592283Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4592768Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4593288Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4593799Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4594143Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.4594369Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.4594481Z Traceback (most recent call last): 2025-12-04T09:41:43.4594900Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.4595086Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.4595485Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.4595667Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.4595839Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.4595925Z Searched string: 2025-12-04T09:41:43.4596063Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.4596069Z 2025-12-04T09:41:43.4596199Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.4596204Z 2025-12-04T09:41:43.4596336Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.4596464Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.4596468Z 2025-12-04T09:41:43.4596566Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.4596659Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.4596760Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.4596856Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.4596860Z 2025-12-04T09:41:43.4596954Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.4597051Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.4597147Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.4597242Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.4597246Z 2025-12-04T09:41:43.4597250Z 2025-12-04T09:41:43.4597417Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.4597422Z 2025-12-04T09:41:43.4597473Z 2025-12-04T09:41:43.4597624Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.4597774Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.4597894Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.4597985Z idx_m = rm[:, None] 2025-12-04T09:41:43.4598081Z idx_n = rn[None, :] 2025-12-04T09:41:43.4598178Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.4598183Z 2025-12-04T09:41:43.4598286Z # inductor generates a suffix 2025-12-04T09:41:43.4598388Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.4598603Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.4598699Z ''', device_str='cuda') 2025-12-04T09:41:43.4598707Z 2025-12-04T09:41:43.4598711Z 2025-12-04T09:41:43.4598812Z async_compile.wait(globals()) 2025-12-04T09:41:43.4598896Z del async_compile 2025-12-04T09:41:43.4598901Z 2025-12-04T09:41:43.4598986Z class Runner: 2025-12-04T09:41:43.4599090Z def __init__(self, partitions): 2025-12-04T09:41:43.4599195Z self.partitions = partitions 2025-12-04T09:41:43.4599204Z 2025-12-04T09:41:43.4599313Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.4599404Z new_callables = [] 2025-12-04T09:41:43.4599579Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.4599684Z new_callables.append(fn(c)) 2025-12-04T09:41:43.4599786Z self.partitions = new_callables 2025-12-04T09:41:43.4599793Z 2025-12-04T09:41:43.4599889Z def call(self, args): 2025-12-04T09:41:43.4600025Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.4600108Z args.clear() 2025-12-04T09:41:43.4600362Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.4600541Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.4600651Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.4600748Z torch.cuda.set_device(0) 2025-12-04T09:41:43.4600918Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.4601147Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.4601243Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.4601433Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.4601521Z del arg0_1 2025-12-04T09:41:43.4601682Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.4602009Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.4602111Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.4602329Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:43.4602421Z del arg1_1 2025-12-04T09:41:43.4602500Z del buf0 2025-12-04T09:41:43.4602584Z return (buf1, ) 2025-12-04T09:41:43.4602588Z 2025-12-04T09:41:43.4602693Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.4602777Z call = runner.call 2025-12-04T09:41:43.4602938Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.4602942Z 2025-12-04T09:41:43.4602952Z 2025-12-04T09:41:43.4603091Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.4603221Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.4603373Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.4603576Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.4603778Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.4603881Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.4604046Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.4604051Z 2025-12-04T09:41:43.4604054Z 2025-12-04T09:41:43.4604149Z if __name__ == "__main__": 2025-12-04T09:41:43.4604410Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.4604570Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.4604658Z From CHECK: .to( 2025-12-04T09:41:43.4604662Z 2025-12-04T09:41:43.4604666Z 2025-12-04T09:41:43.4604839Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.4605396Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.4605404Z 2025-12-04T09:41:43.4605622Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.4605796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4605902Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4606032Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4606282Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4607714Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4607801Z graph_break [] 2025-12-04T09:41:43.4608006Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4608182Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4608328Z Autotune Choices Stats: 2025-12-04T09:41:43.4609184Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4609281Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4609377Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4609485Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4609971Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4610489Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4610957Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4611433Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4611892Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4612371Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4612835Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4613296Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4613803Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4614271Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4614610Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.4614706Z Autotune Choices Stats: 2025-12-04T09:41:43.4615559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4615663Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4615751Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4615861Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4616343Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4616812Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4617320Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4617837Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4618348Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4618810Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4619271Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4619783Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4620253Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4620729Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4621061Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.4621241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4621335Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4621469Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4621722Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4622674Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4622766Z graph_break [] 2025-12-04T09:41:43.4622871Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4623087Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4623185Z Autotune Choices Stats: 2025-12-04T09:41:43.4624015Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4624114Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4624200Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4624309Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4624784Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4625260Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4625736Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4626213Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4626734Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4627223Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4627733Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4628208Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4628681Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4629195Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4629534Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.4629629Z Autotune Choices Stats: 2025-12-04T09:41:43.4630475Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4630570Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4630668Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4630775Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4631243Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4631720Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4632195Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4632705Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4633171Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4633639Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4634117Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4634591Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4635064Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4635543Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4635878Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.4636007Z Autotune Choices Stats: 2025-12-04T09:41:43.4636858Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.4637014Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4637112Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4637250Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4637737Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4638202Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4638724Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4639200Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4639726Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4640197Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4640670Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4641159Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4641623Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4642156Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4642490Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.4642675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4646386Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4646543Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4646810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4647804Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4647891Z graph_break [] 2025-12-04T09:41:43.4647992Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4648172Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4648266Z Autotune Choices Stats: 2025-12-04T09:41:43.4649176Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.4649279Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4649372Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4649518Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4650005Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4650474Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4650934Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4651410Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4651930Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4652414Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4652879Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4653350Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4653819Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4654282Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4654616Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.4654751Z Autotune Choices Stats: 2025-12-04T09:41:43.4655588Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.4655681Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4655770Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4655882Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4656354Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4656827Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4657294Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4657755Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4658218Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4658721Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4659226Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4659698Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4660171Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4660641Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4661013Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.4661108Z Autotune Choices Stats: 2025-12-04T09:41:43.4661955Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4662050Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4662140Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4662249Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4662739Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4663216Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4663695Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4664199Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4664667Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4665131Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4665599Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4666068Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4666529Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4667001Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4667331Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.4667527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4667681Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4667823Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4668115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4669490Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4669578Z graph_break [] 2025-12-04T09:41:43.4669682Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4669855Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4669989Z Autotune Choices Stats: 2025-12-04T09:41:43.4670832Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4670927Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4671015Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4671120Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4671600Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4672065Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4672526Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4672989Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4673489Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4673959Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4674425Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4674901Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4675370Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4675833Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4676165Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.4676256Z Autotune Choices Stats: 2025-12-04T09:41:43.4677156Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4677271Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4677400Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4677507Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4677983Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4678457Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4678928Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4679458Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4680002Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4680479Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4680948Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4681425Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4681906Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4682378Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4682752Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.4682930Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4683022Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4683158Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4683403Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4684345Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4684436Z graph_break [] 2025-12-04T09:41:43.4684543Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4684718Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4684807Z Autotune Choices Stats: 2025-12-04T09:41:43.4685639Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.4685734Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4685819Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4685923Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4686454Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4686926Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4687443Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4687911Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4688379Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4688885Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4689353Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4689828Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4690298Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4690774Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4691106Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.4691198Z Autotune Choices Stats: 2025-12-04T09:41:43.4692063Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4692156Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4692243Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4692347Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4692819Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4693297Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4693768Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4694243Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4694709Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4695178Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4695683Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4696157Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4696681Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4697153Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4697491Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.4697599Z Autotune Choices Stats: 2025-12-04T09:41:43.4698499Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4698593Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4698676Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4698792Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4699273Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4699756Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4700222Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4700855Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4701459Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4701928Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4702398Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4702877Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4703350Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4703825Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4704153Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.4704328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4704420Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4704553Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4704800Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4705804Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4705940Z graph_break [] 2025-12-04T09:41:43.4706043Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4706218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4706311Z Autotune Choices Stats: 2025-12-04T09:41:43.4707135Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4707285Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4707369Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4707476Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4708006Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4708484Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4708958Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4709434Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4709919Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4710395Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4710909Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4711391Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4711857Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4712336Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4712667Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.4712758Z Autotune Choices Stats: 2025-12-04T09:41:43.4713591Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4713682Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4713769Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4713871Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4714378Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4714859Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4715397Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4715872Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4716340Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4716850Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4717316Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4717792Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4718269Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4718738Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4719069Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.4719162Z Autotune Choices Stats: 2025-12-04T09:41:43.4720036Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4720179Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4720265Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4720375Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4720843Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4721317Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4721793Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4722272Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4722752Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4723227Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4723749Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4724218Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4724729Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4725198Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4725528Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.4725702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4725834Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4725967Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4726215Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4727652Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4727741Z graph_break [] 2025-12-04T09:41:43.4727843Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4728017Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4728110Z Autotune Choices Stats: 2025-12-04T09:41:43.4728943Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4729039Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4729123Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4729226Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4729738Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4730209Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4730688Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4731166Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4731648Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4732124Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4732600Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4733125Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4733606Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4734121Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4734448Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.4734542Z Autotune Choices Stats: 2025-12-04T09:41:43.4735379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4735512Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4735601Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4735703Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4736181Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4736659Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4737127Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4737621Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4738118Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4738588Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4739094Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4739566Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4740044Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4740517Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4740845Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.4741020Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4741112Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4741244Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4741487Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4742902Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4743023Z graph_break [] 2025-12-04T09:41:43.4743127Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4743302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4743395Z Autotune Choices Stats: 2025-12-04T09:41:43.4744238Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4744330Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4744415Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4744561Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4745046Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4745525Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4745993Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4746461Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4746928Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4747396Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4747870Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4748377Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4748851Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4749323Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4749653Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.4749748Z Autotune Choices Stats: 2025-12-04T09:41:43.4750584Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4750678Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4750763Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4750865Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4751344Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4751882Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4752394Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4752868Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4753338Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4753803Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4754311Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4754788Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4755261Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4755734Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4756058Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.4756236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4756331Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4756459Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4756707Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4758168Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4758254Z graph_break [] 2025-12-04T09:41:43.4758355Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4758533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4758628Z Autotune Choices Stats: 2025-12-04T09:41:43.4759460Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.4759609Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4759700Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4759803Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4760278Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4760751Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4761264Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4761780Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4762252Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4762728Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4763200Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4763710Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4764178Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4764644Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4764973Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.4765063Z Autotune Choices Stats: 2025-12-04T09:41:43.4765901Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4765998Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4766081Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4766186Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4766701Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4767178Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4767699Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4768175Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4768647Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4769114Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4769581Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4770052Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4770566Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4771074Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4771400Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.4771574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4771667Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4771799Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4772042Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4773458Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4773546Z graph_break [] 2025-12-04T09:41:43.4773652Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4773830Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4773918Z Autotune Choices Stats: 2025-12-04T09:41:43.4774744Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4774841Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4774927Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4775032Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4775504Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4776018Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4776494Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4776973Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4777459Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4777937Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4778424Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4778895Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4779369Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4779878Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4780244Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.4780338Z Autotune Choices Stats: 2025-12-04T09:41:43.4781166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.4781257Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4781344Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4781446Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4781963Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4782430Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4782902Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4783376Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4783842Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4784315Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4784784Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4785326Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4785802Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4786276Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4786622Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.4786796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4786898Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4787033Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4787293Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4788725Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4788813Z graph_break [] 2025-12-04T09:41:43.4788925Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4789146Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4789241Z Autotune Choices Stats: 2025-12-04T09:41:43.4790148Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4790242Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4790331Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4790438Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4790918Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4791438Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4791907Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4792385Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4792852Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4793319Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4793800Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4794274Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4794789Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4795266Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4795596Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.4795690Z Autotune Choices Stats: 2025-12-04T09:41:43.4796533Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4796632Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4796720Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4796828Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4797306Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4797830Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4798346Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4798816Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4799323Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4799836Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4800428Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4800977Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4801451Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4801931Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4802262Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.4802441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4802534Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4802664Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4802918Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4803853Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4803942Z graph_break [] 2025-12-04T09:41:43.4804044Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4804274Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4804371Z Autotune Choices Stats: 2025-12-04T09:41:43.4805202Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.4805302Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4805387Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4805494Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4805977Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4806454Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4806924Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4807397Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4807929Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4808464Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4808948Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4809430Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4809912Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4810431Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4810761Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.4810853Z Autotune Choices Stats: 2025-12-04T09:41:43.4811689Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4811781Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4811875Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4811977Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4812450Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4812927Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4813438Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4813912Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4814379Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4814848Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4815317Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4815795Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4816275Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4816748Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4817121Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.4817215Z Autotune Choices Stats: 2025-12-04T09:41:43.4818114Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4818257Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4818342Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4818451Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4818938Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4819412Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4819970Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4820447Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4820930Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4821406Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4821885Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4822368Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4822889Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4823374Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4823703Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.4823879Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4823974Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4824106Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4824360Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4825306Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4825394Z graph_break [] 2025-12-04T09:41:43.4825500Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4825673Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4825769Z Autotune Choices Stats: 2025-12-04T09:41:43.4826645Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4826742Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4826869Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4826973Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4827477Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4827980Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4828458Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4828993Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4829464Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4829946Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4830413Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4830884Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4831362Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4831833Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4832209Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.4832305Z Autotune Choices Stats: 2025-12-04T09:41:43.4833157Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4833251Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4833338Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4833448Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4833921Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4834416Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4834888Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4835354Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4835871Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4836343Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4836856Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4837337Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4837817Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4838337Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4838665Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.4838766Z Autotune Choices Stats: 2025-12-04T09:41:43.4839651Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4839750Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4839838Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4839953Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4840436Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4840919Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4841403Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4841926Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4842413Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4842897Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4843384Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4843873Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4844341Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4844822Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4845156Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.4845371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4845475Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4845649Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4845904Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4846841Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4846925Z graph_break [] 2025-12-04T09:41:43.4847036Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4847212Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4847415Z Autotune Choices Stats: 2025-12-04T09:41:43.4848300Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4848397Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4848490Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4848598Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4849086Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4849561Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4850043Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4850529Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4851057Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4851539Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4852008Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4852492Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4852964Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4853436Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4853776Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.4853872Z Autotune Choices Stats: 2025-12-04T09:41:43.4854747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.4854847Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4854936Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4855116Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4855589Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4856072Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4856546Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4857015Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4857554Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4858058Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4858530Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4859004Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4859488Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4859964Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4860294Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.4860435Z Autotune Choices Stats: 2025-12-04T09:41:43.4861272Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4861372Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4861460Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4861573Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4862060Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4862538Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4863017Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4863487Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4863958Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4864476Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4864994Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4865482Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4865958Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4866446Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4866820Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.4866999Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4867100Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4867233Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4867488Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4868872Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4868963Z graph_break [] 2025-12-04T09:41:43.4869073Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4869249Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4869349Z Autotune Choices Stats: 2025-12-04T09:41:43.4870233Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4870333Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4870428Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4870534Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4871018Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4871499Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4871984Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4872467Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4872937Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4873417Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4873935Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4874446Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4874922Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4875396Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4875734Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.4875868Z Autotune Choices Stats: 2025-12-04T09:41:43.4876712Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4876813Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4876904Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4877017Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4877540Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4878020Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4878502Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4878976Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4879542Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4880015Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4880489Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4880971Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4881453Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4881931Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4882262Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.4882442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4882538Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4882679Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4882968Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4884355Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4884485Z graph_break [] 2025-12-04T09:41:43.4884591Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4884775Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4884868Z Autotune Choices Stats: 2025-12-04T09:41:43.4885724Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.4885867Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4885954Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4886060Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4886550Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4887024Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4887552Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4888028Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4888505Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4889020Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4889496Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4889975Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4890448Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4890927Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4891261Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.4891359Z Autotune Choices Stats: 2025-12-04T09:41:43.4892190Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.4892286Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4892443Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4892553Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4893032Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4893547Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4894015Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4894497Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4895012Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4895485Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4895955Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4896437Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4896911Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4897396Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4897730Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.4897904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4898051Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4898186Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4898434Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4899823Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4899917Z graph_break [] 2025-12-04T09:41:43.4900025Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4900201Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4900443Z Autotune Choices Stats: 2025-12-04T09:41:43.4901309Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4901406Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4901497Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4901604Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4902159Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4902646Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4903185Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4903669Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4904155Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4904687Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4905173Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4905646Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4906124Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4906594Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4906937Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.4907030Z Autotune Choices Stats: 2025-12-04T09:41:43.4907981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4908082Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4908171Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4908276Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4908763Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4909239Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4909712Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4910187Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4910661Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4911139Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4911651Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4912136Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4912663Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4913154Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4913485Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.4913666Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4913803Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4913940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4914197Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4915588Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4915677Z graph_break [] 2025-12-04T09:41:43.4915785Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4915960Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4916059Z Autotune Choices Stats: 2025-12-04T09:41:43.4916894Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4916996Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4917085Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4917230Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4917766Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4918249Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4918735Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4919213Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4919778Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4920269Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4920740Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4921262Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4921769Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4922245Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4922579Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.4922671Z Autotune Choices Stats: 2025-12-04T09:41:43.4923514Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4923653Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4923746Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4923855Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4927951Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4928459Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4928933Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4929412Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4929884Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4930452Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4930928Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4931404Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4931885Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4932357Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4932698Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.4932875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4932971Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4933107Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4933352Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4934335Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4934425Z graph_break [] 2025-12-04T09:41:43.4934575Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4934754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4934846Z Autotune Choices Stats: 2025-12-04T09:41:43.4935693Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.4935791Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4935878Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4935988Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4936519Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4937017Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4937519Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4937995Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4938469Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4938950Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4939422Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4939935Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4940408Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4940889Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4941224Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.4941319Z Autotune Choices Stats: 2025-12-04T09:41:43.4942160Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.4942253Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4942345Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4942453Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4942932Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4943451Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4943924Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4944441Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4944908Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4945378Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4945882Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4946358Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4946834Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4947307Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4947664Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.4947776Z Autotune Choices Stats: 2025-12-04T09:41:43.4948618Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4948713Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4948798Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4948912Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.4949426Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4949908Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4950381Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4950858Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4951343Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4951819Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4952301Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4952820Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4953304Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4953831Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4954160Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.4954338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4954431Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4954565Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4954812Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4956230Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4956322Z graph_break [] 2025-12-04T09:41:43.4956425Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4956605Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4956696Z Autotune Choices Stats: 2025-12-04T09:41:43.4957545Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.4957644Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4957731Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4957840Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4958323Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4958843Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4959330Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4959873Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4960353Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4960826Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4961302Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4961772Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4962288Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4962767Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4963140Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.4963238Z Autotune Choices Stats: 2025-12-04T09:41:43.4964082Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4964174Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4964262Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4964432Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4964919Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4965395Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4965871Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4966353Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4966818Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4967303Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4967811Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4968330Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4968806Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4969281Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4969620Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.4969798Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4969893Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4970022Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4970271Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4971649Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4971776Z graph_break [] 2025-12-04T09:41:43.4971881Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4972055Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4972186Z Autotune Choices Stats: 2025-12-04T09:41:43.4973048Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4973141Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4973230Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4973333Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4973809Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4974331Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4974807Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4975289Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4975769Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.4976238Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4976717Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4977188Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4977752Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4978220Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4978557Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.4978649Z Autotune Choices Stats: 2025-12-04T09:41:43.4979494Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4979593Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4979681Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4979786Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4980263Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4980733Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4981249Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4981752Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4982224Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4982690Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4983159Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4983675Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4984151Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4984629Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4984957Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.4985136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.4985229Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.4985361Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.4985614Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.4987036Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.4987127Z graph_break [] 2025-12-04T09:41:43.4987233Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.4987407Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.4987505Z Autotune Choices Stats: 2025-12-04T09:41:43.4988334Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4988436Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4988520Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4988625Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.4989105Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4989592Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4990070Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.4990591Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4991113Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4991590Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4992057Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4992536Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4993057Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4993532Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4993862Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.4993953Z Autotune Choices Stats: 2025-12-04T09:41:43.4994797Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.4994891Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.4994983Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.4995087Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.4995569Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4996090Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.4996559Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.4997030Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4997556Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.4998024Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4998506Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.4998980Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.4999456Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5000023Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5000664Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.5000838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5000933Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5001067Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5001312Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5002706Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5002878Z graph_break [] 2025-12-04T09:41:43.5002980Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5003156Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5003247Z Autotune Choices Stats: 2025-12-04T09:41:43.5004101Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5004195Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5004279Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5004386Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5004868Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5005345Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5005875Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5006344Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5006817Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5007299Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5007829Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5008306Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5008787Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5009271Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5009665Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.5009765Z Autotune Choices Stats: 2025-12-04T09:41:43.5010672Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5010769Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5010855Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5010958Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5011444Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5011964Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5012443Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5012915Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5013381Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5013853Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5014325Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5014802Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5015319Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5015797Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5016127Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.5016299Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5016398Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5016529Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5016777Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5017721Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5017802Z graph_break [] 2025-12-04T09:41:43.5017910Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5018082Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5018172Z Autotune Choices Stats: 2025-12-04T09:41:43.5019043Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5019175Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5019264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5019368Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5019846Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5020327Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5020802Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5021320Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5021806Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5022296Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5022771Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5023241Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5023723Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5024196Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5024571Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.5024666Z Autotune Choices Stats: 2025-12-04T09:41:43.5025515Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5025610Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5025694Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5025801Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5026272Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5026745Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5027218Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5027740Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5028252Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5028720Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5029234Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5029707Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5030177Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5030691Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5031021Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.5031115Z Autotune Choices Stats: 2025-12-04T09:41:43.5031937Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5032032Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5032116Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5032226Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5032706Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5033183Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5033701Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5034174Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5034651Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5035136Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5035618Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5036104Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5036586Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5037066Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5037468Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.5037672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5037828Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5037958Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5038202Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5039647Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5039782Z graph_break [] 2025-12-04T09:41:43.5039889Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5040063Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5040153Z Autotune Choices Stats: 2025-12-04T09:41:43.5040994Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5041086Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5041176Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5041280Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5041756Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5042237Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5042708Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5043233Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5043708Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5044183Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5044672Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5045140Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5045622Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5046088Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5046423Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.5046513Z Autotune Choices Stats: 2025-12-04T09:41:43.5047777Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5047911Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5047997Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5048102Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5048585Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5049057Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5049532Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5050048Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5050523Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5050989Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5051459Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5051931Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5052406Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5052926Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5053253Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.5053428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5053521Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5053650Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5053899Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5054842Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5054929Z graph_break [] 2025-12-04T09:41:43.5055032Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5055206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5055300Z Autotune Choices Stats: 2025-12-04T09:41:43.5056133Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5056229Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5056313Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5056457Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5056964Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5057506Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5057981Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5058454Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5058983Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5059464Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5059952Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5060425Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5060902Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5061376Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5061708Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.5061798Z Autotune Choices Stats: 2025-12-04T09:41:43.5062684Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5062778Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5062868Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5062971Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5063444Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5063919Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5064386Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5064861Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5065327Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5065834Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5066302Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5066814Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5067289Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5067806Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5068189Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.5068284Z Autotune Choices Stats: 2025-12-04T09:41:43.5069114Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.5069220Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5069307Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5069427Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5069903Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5070369Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5070853Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5071322Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5071862Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5072329Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5072798Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5073273Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5073747Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5074222Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5074549Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.5074730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5074826Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5074960Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5075251Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5076674Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5076766Z graph_break [] 2025-12-04T09:41:43.5076868Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5077041Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5077139Z Autotune Choices Stats: 2025-12-04T09:41:43.5078010Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5078111Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5078198Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5078301Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5078783Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5079268Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5079795Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5080270Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5080742Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5081261Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5081728Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5082200Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5082672Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5083149Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5083481Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.5083573Z Autotune Choices Stats: 2025-12-04T09:41:43.5084415Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5084512Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5084644Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5084750Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5085272Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5085755Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5086224Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5086694Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5087207Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5087732Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5088203Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5088677Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5089151Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5089632Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5089971Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.5090186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5090282Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5090417Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5090664Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5092039Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5092127Z graph_break [] 2025-12-04T09:41:43.5092232Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5092410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5092500Z Autotune Choices Stats: 2025-12-04T09:41:43.5093339Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5093434Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5093520Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5093631Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5094163Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5094686Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5095168Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5095639Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5096118Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5096633Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5097116Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5097637Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5098119Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5098598Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5098935Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.5099039Z Autotune Choices Stats: 2025-12-04T09:41:43.5099912Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5100011Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5100098Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5100205Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5100834Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5101312Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5101787Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5102264Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5102731Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5103202Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5103747Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5104284Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5104761Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5105242Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5105577Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.5105833Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5105935Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5106066Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5106323Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5107753Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5107836Z graph_break [] 2025-12-04T09:41:43.5107947Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5108122Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5108228Z Autotune Choices Stats: 2025-12-04T09:41:43.5109059Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.5109155Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5109250Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5109414Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5109900Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5110382Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5110865Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5111360Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5111836Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5112318Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5112798Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5113325Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5113835Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5114309Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5114649Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.5114741Z Autotune Choices Stats: 2025-12-04T09:41:43.5115582Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5115716Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5115805Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5115912Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5116397Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5116880Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5117347Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5117823Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5118299Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5118809Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5119287Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5119810Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5120297Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5120779Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5121115Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.5121294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5121389Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5121526Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5121774Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5122758Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5122887Z graph_break [] 2025-12-04T09:41:43.5122991Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5123165Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5123260Z Autotune Choices Stats: 2025-12-04T09:41:43.5124092Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5124195Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5124282Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5124389Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5124914Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5125390Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5125875Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5126346Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5126824Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5127300Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5127827Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5128341Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5128816Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5129287Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5129624Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.5129717Z Autotune Choices Stats: 2025-12-04T09:41:43.5130550Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5130645Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5130736Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5130839Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5131312Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5131832Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5132300Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5132821Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5133286Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5133764Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5134277Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5134755Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5135241Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5135712Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5136048Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.5136149Z Autotune Choices Stats: 2025-12-04T09:41:43.5136982Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5137077Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5137161Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5137282Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5137802Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5138274Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5138752Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5139238Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5139730Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5140198Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5140676Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5141191Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5141674Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5142222Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5142550Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.5142732Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5142828Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5142960Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5143255Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5144198Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5144293Z graph_break [] 2025-12-04T09:41:43.5144400Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5144575Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5144672Z Autotune Choices Stats: 2025-12-04T09:41:43.5145515Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.5145613Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5145699Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5145806Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5146298Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5146811Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5147298Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5147832Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5148312Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5148796Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5149268Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5149742Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5150210Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5150798Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5151168Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.5151266Z Autotune Choices Stats: 2025-12-04T09:41:43.5152098Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5152192Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5152285Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5152390Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5152867Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5153391Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5153871Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5154351Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5154826Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5155303Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5155776Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5156293Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5156773Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5157247Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5157611Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.5157712Z Autotune Choices Stats: 2025-12-04T09:41:43.5158563Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5158668Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5158755Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5158872Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5159346Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5159890Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5160417Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5160935Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5161430Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5161905Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5162392Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5162910Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5163390Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5163872Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5164201Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.5164379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5164477Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5164612Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5164870Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5166283Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5166379Z graph_break [] 2025-12-04T09:41:43.5166482Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5166655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5166755Z Autotune Choices Stats: 2025-12-04T09:41:43.5167591Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.5167697Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5167784Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5167889Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5168381Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5168860Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5169379Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5169864Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5170384Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5170870Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5171340Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5171863Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5172332Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5172813Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5173149Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.5173242Z Autotune Choices Stats: 2025-12-04T09:41:43.5174081Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5174183Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5174278Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5174387Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5174862Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5175380Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5175860Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5176334Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5176810Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5177294Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5177810Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5178288Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5178836Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5179316Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5179700Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.5179932Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.5180036Z Traceback (most recent call last): 2025-12-04T09:41:43.5180458Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.5180642Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.5180997Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.5181225Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.5181394Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.5181485Z Searched string: 2025-12-04T09:41:43.5181621Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.5181626Z 2025-12-04T09:41:43.5181747Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.5181760Z 2025-12-04T09:41:43.5181890Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.5182015Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.5182020Z 2025-12-04T09:41:43.5182124Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.5182216Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.5182307Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.5182407Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.5182412Z 2025-12-04T09:41:43.5182501Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.5182595Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.5182690Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.5182781Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.5182785Z 2025-12-04T09:41:43.5182789Z 2025-12-04T09:41:43.5182955Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.5182960Z 2025-12-04T09:41:43.5182964Z 2025-12-04T09:41:43.5183083Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.5183201Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.5183364Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.5183451Z idx_m = rm[:, None] 2025-12-04T09:41:43.5183544Z idx_n = rn[None, :] 2025-12-04T09:41:43.5183637Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.5183641Z 2025-12-04T09:41:43.5183742Z # inductor generates a suffix 2025-12-04T09:41:43.5183839Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.5184058Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.5184148Z ''', device_str='cuda') 2025-12-04T09:41:43.5184152Z 2025-12-04T09:41:43.5184158Z 2025-12-04T09:41:43.5184263Z async_compile.wait(globals()) 2025-12-04T09:41:43.5184346Z del async_compile 2025-12-04T09:41:43.5184354Z 2025-12-04T09:41:43.5184437Z class Runner: 2025-12-04T09:41:43.5184541Z def __init__(self, partitions): 2025-12-04T09:41:43.5184644Z self.partitions = partitions 2025-12-04T09:41:43.5184648Z 2025-12-04T09:41:43.5184775Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.5184866Z new_callables = [] 2025-12-04T09:41:43.5184982Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.5185094Z new_callables.append(fn(c)) 2025-12-04T09:41:43.5185199Z self.partitions = new_callables 2025-12-04T09:41:43.5185203Z 2025-12-04T09:41:43.5185302Z def call(self, args): 2025-12-04T09:41:43.5185392Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.5185473Z args.clear() 2025-12-04T09:41:43.5185618Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.5185791Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.5185901Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.5186041Z torch.cuda.set_device(0) 2025-12-04T09:41:43.5186213Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.5186433Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.5186540Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.5186733Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.5186823Z del arg0_1 2025-12-04T09:41:43.5186986Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.5187240Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.5187388Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.5187611Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.5187693Z del arg1_1 2025-12-04T09:41:43.5187796Z del buf0 2025-12-04T09:41:43.5187896Z return (buf1, ) 2025-12-04T09:41:43.5187900Z 2025-12-04T09:41:43.5188033Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.5188119Z call = runner.call 2025-12-04T09:41:43.5188283Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.5188290Z 2025-12-04T09:41:43.5188294Z 2025-12-04T09:41:43.5188444Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.5188579Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.5188731Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.5188937Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.5189141Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.5189247Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.5189415Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.5189419Z 2025-12-04T09:41:43.5189426Z 2025-12-04T09:41:43.5189514Z if __name__ == "__main__": 2025-12-04T09:41:43.5189723Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.5189881Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.5190009Z From CHECK: .to( 2025-12-04T09:41:43.5190019Z 2025-12-04T09:41:43.5190023Z 2025-12-04T09:41:43.5194341Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.5194927Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.5194933Z 2025-12-04T09:41:43.5195157Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.5195345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5195451Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5195583Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5195835Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5197224Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5197310Z graph_break [] 2025-12-04T09:41:43.5197418Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5197591Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5197687Z Autotune Choices Stats: 2025-12-04T09:41:43.5198606Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5198742Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5198830Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5198942Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5199427Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5199959Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5200621Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5201089Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5201551Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5202024Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5202487Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5202951Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5203414Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5203955Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5204293Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.5204397Z Autotune Choices Stats: 2025-12-04T09:41:43.5205227Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5205329Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5205420Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5205526Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5206010Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5206471Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5206950Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5207507Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5207971Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5208492Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5208953Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5209425Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5209949Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5210417Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5210756Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.5210932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5211033Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5211167Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5211418Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5212361Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5212450Z graph_break [] 2025-12-04T09:41:43.5212557Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5212732Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5212829Z Autotune Choices Stats: 2025-12-04T09:41:43.5213702Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5213800Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5213892Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5213999Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5214476Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5214954Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5215433Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5215914Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5216393Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5216922Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5217428Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5217955Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5218425Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5218889Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5219265Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.5219359Z Autotune Choices Stats: 2025-12-04T09:41:43.5220215Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5220312Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5220398Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5220504Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5220974Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5221443Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5221912Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5222425Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5222890Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5223351Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5223821Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5224288Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5224765Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5225235Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5225566Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.5225662Z Autotune Choices Stats: 2025-12-04T09:41:43.5226535Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.5226670Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5226758Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5226872Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5227356Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5227820Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5228288Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5228824Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5229299Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5229777Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5230246Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5230719Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5231185Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5231693Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5232023Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.5232198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5232294Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5232426Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5232675Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5233620Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5233710Z graph_break [] 2025-12-04T09:41:43.5233819Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5233994Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5234089Z Autotune Choices Stats: 2025-12-04T09:41:43.5234925Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.5235022Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5235116Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5235263Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5235747Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5236258Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5236721Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5237188Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5237704Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5238183Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5238653Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5239123Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5239641Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5240108Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5240446Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.5240542Z Autotune Choices Stats: 2025-12-04T09:41:43.5241408Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.5241506Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5241593Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5241703Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5242174Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5242645Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5243120Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5243585Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5244060Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5244524Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5245031Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5245541Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5246011Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5246484Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5246812Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.5246951Z Autotune Choices Stats: 2025-12-04T09:41:43.5247838Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5247938Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5248028Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5248140Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5248616Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5249088Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5249569Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5250038Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5250537Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5251004Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5251466Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5251937Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5252400Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5252869Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5253203Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.5253377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5253474Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5253611Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5253895Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5255273Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5255394Z graph_break [] 2025-12-04T09:41:43.5255501Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5255675Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5255766Z Autotune Choices Stats: 2025-12-04T09:41:43.5256625Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5256762Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5256852Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5256959Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5257442Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5257931Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5258426Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5258900Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5259366Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5259878Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5260350Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5260821Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5261300Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5261766Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5262105Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.5262198Z Autotune Choices Stats: 2025-12-04T09:41:43.5263034Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5263132Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5263221Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5263369Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5263851Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5264386Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5264868Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5265334Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5265845Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5266307Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5266778Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5267253Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5267727Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5268207Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5268538Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.5268715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5268811Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5268983Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5269235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5270177Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5270266Z graph_break [] 2025-12-04T09:41:43.5270371Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5270544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5270642Z Autotune Choices Stats: 2025-12-04T09:41:43.5271483Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.5271582Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5271670Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5271777Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5272261Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5272782Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5273293Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5273765Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5274233Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5274704Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5275210Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5275688Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5276163Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5276639Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5276970Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.5277066Z Autotune Choices Stats: 2025-12-04T09:41:43.5277954Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5278050Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5278140Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5278282Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5278760Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5279237Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5279772Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5280244Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5280716Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5281183Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5281652Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5282168Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5282683Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5283158Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5283490Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.5283583Z Autotune Choices Stats: 2025-12-04T09:41:43.5284424Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5284561Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5284652Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5284766Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5285250Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5285726Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5286194Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5286668Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5287145Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5287682Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5288175Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5288648Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5289124Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5289600Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5289935Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.5290111Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5290206Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5290337Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5290590Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5291567Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5291692Z graph_break [] 2025-12-04T09:41:43.5291797Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5291971Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5292065Z Autotune Choices Stats: 2025-12-04T09:41:43.5292895Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5292988Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5293079Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5293186Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5293707Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5294180Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5294658Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5295138Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5295614Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5296097Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5296571Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5297147Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5297639Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5298111Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5298446Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.5298540Z Autotune Choices Stats: 2025-12-04T09:41:43.5299398Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5299490Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5299575Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5299683Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5300156Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5300884Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5301362Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5301898Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5302371Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5302839Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5303365Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5303839Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5304320Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5304790Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5305123Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.5305228Z Autotune Choices Stats: 2025-12-04T09:41:43.5306060Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5306160Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5306247Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5306359Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5306890Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5307364Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5307893Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5308374Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5308858Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5309342Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5309827Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5310339Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5310816Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5311326Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5311652Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.5311829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5311922Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5312053Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5312345Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5313733Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5313822Z graph_break [] 2025-12-04T09:41:43.5313927Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5314099Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5314191Z Autotune Choices Stats: 2025-12-04T09:41:43.5315026Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5315124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5315211Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5315319Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5315838Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5316313Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5316788Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5317271Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5317779Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5318290Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5318771Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5319257Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5319826Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5320304Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5320677Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.5320770Z Autotune Choices Stats: 2025-12-04T09:41:43.5321608Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5321703Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5321831Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5321938Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5322423Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5322902Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5323375Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5323855Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5324328Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5324798Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5325310Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5325786Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5326263Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5326738Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5327072Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.5327251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5327349Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5327485Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5327733Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5329110Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5329237Z graph_break [] 2025-12-04T09:41:43.5329342Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5329557Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5329649Z Autotune Choices Stats: 2025-12-04T09:41:43.5330502Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5330598Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5330685Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5330791Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5331277Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5331803Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5332288Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5332756Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5333230Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5333702Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5334194Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5334730Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5335206Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5335687Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5336022Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.5336126Z Autotune Choices Stats: 2025-12-04T09:41:43.5336973Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5337080Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5337170Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5337278Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5337819Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5338296Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5338814Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5339331Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5339809Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5340286Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5340757Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5341281Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5341759Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5342245Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5342579Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.5342754Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5342859Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5342994Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5343246Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5344676Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5344765Z graph_break [] 2025-12-04T09:41:43.5344877Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5345058Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5345153Z Autotune Choices Stats: 2025-12-04T09:41:43.5345993Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.5346095Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5346186Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5346292Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5346771Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5347254Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5347735Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5348314Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5348825Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5349318Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5349801Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5350271Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5350789Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5351860Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5352773Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.5353309Z Autotune Choices Stats: 2025-12-04T09:41:43.5354308Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5355363Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5355625Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5355891Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5356575Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5357680Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5358729Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5359867Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5360916Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5361957Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5362999Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5364066Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5365109Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5366212Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5367164Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.5367818Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5368204Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5368515Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5369000Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5370732Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5372320Z graph_break [] 2025-12-04T09:41:43.5372554Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5372938Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5373310Z Autotune Choices Stats: 2025-12-04T09:41:43.5374301Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5375337Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5375603Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5375870Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5376550Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5377637Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5378768Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5379832Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5380901Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5381992Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5383078Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5384152Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5385207Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5386249Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5387204Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.5387763Z Autotune Choices Stats: 2025-12-04T09:41:43.5388750Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.5389774Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5390038Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5390306Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5390986Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5392107Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5393148Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5394197Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5395244Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5396280Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5397318Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5398415Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5399554Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5400764Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5401673Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.5402294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5402680Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5402999Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5403485Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5405246Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5406822Z graph_break [] 2025-12-04T09:41:43.5407066Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5407447Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5407881Z Autotune Choices Stats: 2025-12-04T09:41:43.5408967Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5410067Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5410325Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5410600Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5411288Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5412367Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5413484Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5414540Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5415590Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5416637Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5417695Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5418753Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5419808Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5420917Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5421843Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.5422359Z Autotune Choices Stats: 2025-12-04T09:41:43.5423359Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5424399Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5424666Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5424933Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5425618Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5426713Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5428036Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5429383Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5430418Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5431506Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5432555Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5433601Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5434688Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5435758Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5436740Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.5437491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5437964Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5438350Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5438891Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5440259Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5441397Z graph_break [] 2025-12-04T09:41:43.5441626Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5442005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5442377Z Autotune Choices Stats: 2025-12-04T09:41:43.5443415Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.5444632Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5444909Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5445192Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5445982Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5447228Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5448469Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5449702Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5450934Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5452227Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5453315Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5454371Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5455437Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5456500Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5457463Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.5458052Z Autotune Choices Stats: 2025-12-04T09:41:43.5459046Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5460068Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5460334Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5460606Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5461286Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5462353Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5463403Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5464499Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5465550Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5466589Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5467631Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5468736Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5469802Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5470859Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5471767Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.5472287Z Autotune Choices Stats: 2025-12-04T09:41:43.5477923Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5479065Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5479323Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5479669Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5480371Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5481448Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5482503Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5483803Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5485059Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5486310Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5487561Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5488814Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5490072Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5491387Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5492475Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.5493160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5493538Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5493846Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5494329Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5495624Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5496749Z graph_break [] 2025-12-04T09:41:43.5496980Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5497361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5497735Z Autotune Choices Stats: 2025-12-04T09:41:43.5498779Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5499808Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5500111Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5500617Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5501297Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5502443Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5503521Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5504592Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5505724Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5506773Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5507871Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5508920Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5509971Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5511022Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5511940Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.5512460Z Autotune Choices Stats: 2025-12-04T09:41:43.5513509Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5514541Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5514800Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5515064Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5515739Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5516811Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5517913Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5518960Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5520046Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5521156Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5522210Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5523314Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5524364Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5525416Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5526372Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.5526901Z Autotune Choices Stats: 2025-12-04T09:41:43.5527935Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5528967Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5529225Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5529495Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5530177Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5531229Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5532284Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5533391Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5534461Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5535528Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5536603Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5537723Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5538792Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5539829Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5540733Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.5541334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5541712Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5542063Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5542547Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5543896Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5545026Z graph_break [] 2025-12-04T09:41:43.5545257Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5545630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5546003Z Autotune Choices Stats: 2025-12-04T09:41:43.5547004Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5548083Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5548345Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5548610Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5549292Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5550351Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5551402Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5552463Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5553535Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5554650Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5555698Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5556751Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5557856Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5558901Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5559850Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.5560369Z Autotune Choices Stats: 2025-12-04T09:41:43.5561352Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.5562375Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5562635Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5562972Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5563647Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5564757Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5565810Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5566844Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5568054Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5569111Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5570167Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5571214Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5572269Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5573326Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5574239Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.5574761Z Autotune Choices Stats: 2025-12-04T09:41:43.5575794Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5576835Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5577095Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5577364Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5578047Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5579134Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5580200Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5581262Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5582300Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5583378Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5584445Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5585542Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5586594Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5587695Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5588656Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.5589269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5589646Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5589956Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5590438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5592177Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5593727Z graph_break [] 2025-12-04T09:41:43.5593960Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5594336Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5594708Z Autotune Choices Stats: 2025-12-04T09:41:43.5595701Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5596780Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5597036Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5597298Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5598028Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5599084Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5600197Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5601513Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5602559Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5603594Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5604703Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5605752Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5606848Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5607892Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5608800Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.5609323Z Autotune Choices Stats: 2025-12-04T09:41:43.5610317Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5611402Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5611655Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5611913Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5612588Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5613642Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5614679Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5615713Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5616749Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5617889Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5618921Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5619961Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5621008Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5622059Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5622961Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.5623567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5623941Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5624246Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5624722Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5626505Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5628139Z graph_break [] 2025-12-04T09:41:43.5628369Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5628745Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5629116Z Autotune Choices Stats: 2025-12-04T09:41:43.5630106Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.5631187Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5631442Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5631700Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5632383Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5633453Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5634512Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5635549Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5636587Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5637635Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5638757Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5639851Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5640902Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5641953Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5642862Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.5643384Z Autotune Choices Stats: 2025-12-04T09:41:43.5644379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5645401Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5645662Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5645929Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5646656Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5647794Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5648837Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5649884Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5650931Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5652016Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5653053Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5654100Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5655154Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5656207Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5657146Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.5657777Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5658159Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5658466Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5658987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5660730Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5662280Z graph_break [] 2025-12-04T09:41:43.5662516Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5662889Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5663262Z Autotune Choices Stats: 2025-12-04T09:41:43.5664263Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5665292Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5665547Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5665816Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5666499Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5667609Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5668707Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5669772Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5670838Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5671905Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5672995Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5674041Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5675086Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5676125Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5677019Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.5677536Z Autotune Choices Stats: 2025-12-04T09:41:43.5678590Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5679672Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5679941Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5680244Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5680927Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5681977Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5683025Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5684064Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5685106Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5686145Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5687187Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5688321Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5689427Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5690487Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5691400Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.5692002Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5692373Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5692680Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5693205Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5694944Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5696500Z graph_break [] 2025-12-04T09:41:43.5696728Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5697103Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5697476Z Autotune Choices Stats: 2025-12-04T09:41:43.5698468Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5699494Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5699754Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5700018Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5700897Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5701969Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5703037Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5704109Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5705176Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5706253Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5707372Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5708420Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5709532Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5710633Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5711533Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.5712052Z Autotune Choices Stats: 2025-12-04T09:41:43.5713050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5714165Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5714424Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5714692Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5715375Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5716442Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5717547Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5718600Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5719699Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5720739Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5721820Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5722868Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5723929Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5725005Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5725927Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.5726528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5726902Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5727265Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5727744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5729053Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5730188Z graph_break [] 2025-12-04T09:41:43.5730468Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5730845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5731260Z Autotune Choices Stats: 2025-12-04T09:41:43.5732266Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.5733307Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5733567Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5733832Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5734521Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5735639Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5736704Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5737814Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5738866Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5739942Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5741002Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5742045Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5743138Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5744191Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5745100Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.5745619Z Autotune Choices Stats: 2025-12-04T09:41:43.5746605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5747660Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5747946Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5748213Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5748890Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5749948Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5751058Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5752093Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5753188Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5754238Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5755276Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5756369Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5757435Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5758530Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5759451Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.5760064Z Autotune Choices Stats: 2025-12-04T09:41:43.5761059Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5762086Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5762348Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5762627Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5763312Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5764415Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5765485Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5766543Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5767658Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5768742Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5769818Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5770895Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5772015Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5773077Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5774017Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.5774625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5775006Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5775312Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5775791Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5777580Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5779187Z graph_break [] 2025-12-04T09:41:43.5779421Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5779800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5780178Z Autotune Choices Stats: 2025-12-04T09:41:43.5781183Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.5782224Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5782487Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5782761Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5783446Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5784576Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5785625Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5786672Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5787738Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5788796Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5789282Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5789761Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5790236Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5790766Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5791114Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.5791274Z Autotune Choices Stats: 2025-12-04T09:41:43.5792137Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5792232Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5792321Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5792433Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5792910Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5793433Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5793912Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5794386Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5794860Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5799563Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5800074Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5800924Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5801405Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5801959Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5802349Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.5802559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5802656Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5802805Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5803088Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5804812Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5804901Z graph_break [] 2025-12-04T09:41:43.5805014Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5805275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5805370Z Autotune Choices Stats: 2025-12-04T09:41:43.5806211Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5806391Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5806480Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5806590Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5807107Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5807599Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5808132Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5808609Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5809090Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5809558Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5810026Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5810499Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5811012Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5811483Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5811817Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.5811911Z Autotune Choices Stats: 2025-12-04T09:41:43.5812751Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5812851Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5812944Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5813051Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5813535Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5814004Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5814472Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5814989Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5815493Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5815964Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5816430Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5816906Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5817421Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5817896Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5818235Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.5818410Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5818509Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5818640Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5818887Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5820260Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5820348Z graph_break [] 2025-12-04T09:41:43.5820495Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5820672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5820766Z Autotune Choices Stats: 2025-12-04T09:41:43.5821601Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5821701Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5821791Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5821897Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5822377Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5822864Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5823340Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5823816Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5824338Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5824850Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5825322Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5825795Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5826272Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5826781Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5827134Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.5827241Z Autotune Choices Stats: 2025-12-04T09:41:43.5828103Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5828200Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5828288Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5828396Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5828880Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5829356Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5829868Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5830340Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5830809Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5831280Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5831753Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5832229Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5832703Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5833180Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5833582Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.5833799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5833895Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5834029Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5834285Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5835665Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5835792Z graph_break [] 2025-12-04T09:41:43.5835900Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5836076Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5836176Z Autotune Choices Stats: 2025-12-04T09:41:43.5837011Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5837108Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5837196Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5837302Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5837834Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5838309Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5838783Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5839291Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5839815Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5840296Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5840778Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5841261Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5841739Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5842223Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5842558Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.5842656Z Autotune Choices Stats: 2025-12-04T09:41:43.5843548Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5843757Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5843847Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5843959Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5844440Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5844919Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5845448Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5845920Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5846393Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5846862Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5847339Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5847820Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5848300Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5848814Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5849149Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.5849324Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5849418Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5849554Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5849809Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5850763Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5850854Z graph_break [] 2025-12-04T09:41:43.5850961Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5851139Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5851232Z Autotune Choices Stats: 2025-12-04T09:41:43.5852060Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5852159Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5852288Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5852397Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5852915Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5853397Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5853878Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5854354Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5854882Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5855372Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5855851Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5856326Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5856801Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5857306Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5857665Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.5857761Z Autotune Choices Stats: 2025-12-04T09:41:43.5858635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5858731Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5858821Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5858929Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5859413Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5859884Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5860359Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5860839Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5861310Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5861823Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5862330Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5862813Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5863289Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5863766Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5864144Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.5864237Z Autotune Choices Stats: 2025-12-04T09:41:43.5865080Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5865174Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5865260Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5865376Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5865848Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5868770Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5869248Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5869779Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5870263Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5870740Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5871246Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5871726Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5872216Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5872694Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5873032Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.5873209Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5873357Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5873492Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5873741Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5875163Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5875247Z graph_break [] 2025-12-04T09:41:43.5875355Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5875532Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5875624Z Autotune Choices Stats: 2025-12-04T09:41:43.5876462Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5876561Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5876650Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5876761Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5877236Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5877715Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5878270Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5878747Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5879263Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5879812Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5880286Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5880759Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5881239Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5881710Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5882046Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.5882139Z Autotune Choices Stats: 2025-12-04T09:41:43.5883025Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5883123Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5883209Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5883357Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5883838Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5884315Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5884790Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5885263Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5885737Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5886207Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5886677Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5887160Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5887733Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5888209Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5888538Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.5888755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5888856Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5888990Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5889242Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5890188Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5890278Z graph_break [] 2025-12-04T09:41:43.5890387Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5890561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5890657Z Autotune Choices Stats: 2025-12-04T09:41:43.5891482Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5891581Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5891669Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5891775Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5892297Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5892777Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5893300Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5893774Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5894251Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5894737Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5895224Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5895697Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5896177Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5896649Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5897028Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.5897121Z Autotune Choices Stats: 2025-12-04T09:41:43.5898045Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5898141Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5898231Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5898336Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5898809Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5899286Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5899759Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5900239Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5900966Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5901438Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5901981Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5902458Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5903025Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5903496Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5903827Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.5903920Z Autotune Choices Stats: 2025-12-04T09:41:43.5904756Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.5904857Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5904945Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5905059Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5905539Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5906006Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5906484Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5907025Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5907502Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5908025Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5908498Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5908973Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5909447Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5909925Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5910256Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.5910435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5910532Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5910664Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5910913Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5912335Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5912466Z graph_break [] 2025-12-04T09:41:43.5912572Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5912747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5912842Z Autotune Choices Stats: 2025-12-04T09:41:43.5913674Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5913774Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5913862Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5913972Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5914451Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5914937Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5915413Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5915886Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5916400Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5916918Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5917438Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5917906Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5918382Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5918856Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5919191Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.5919290Z Autotune Choices Stats: 2025-12-04T09:41:43.5920175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5920270Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5920358Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5920466Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5920985Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5921497Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5921969Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5922442Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5922907Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5923381Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5923851Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5924330Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5924806Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5925279Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5925661Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.5925838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5925932Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5926069Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5926353Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5927786Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5927876Z graph_break [] 2025-12-04T09:41:43.5927981Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5928158Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5928252Z Autotune Choices Stats: 2025-12-04T09:41:43.5929091Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5929187Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5929275Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5929384Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5929861Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5930387Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5930905Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5931377Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5931851Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5932319Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5932801Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5933271Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5933750Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5934228Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5934557Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.5934700Z Autotune Choices Stats: 2025-12-04T09:41:43.5935549Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5935649Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5935775Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5935882Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5936365Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5936848Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5937360Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5937831Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5938305Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5938773Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5939239Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5939782Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5940292Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5940769Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5941099Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.5941273Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5941371Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5941513Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5941766Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5943146Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5943238Z graph_break [] 2025-12-04T09:41:43.5943344Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5943519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5943619Z Autotune Choices Stats: 2025-12-04T09:41:43.5944459Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.5944609Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5944698Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5944804Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5945322Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5945793Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5946267Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5946744Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5947221Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5947736Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5948228Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.5948713Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5949233Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5949756Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5950092Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.5950185Z Autotune Choices Stats: 2025-12-04T09:41:43.5951026Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5951121Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5951213Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5951323Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5951802Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5952281Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5952749Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5953224Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5953742Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5954222Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5954731Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5955216Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5955696Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5956172Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5956516Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.5956691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5956791Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5956927Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5957174Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5958173Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5958259Z graph_break [] 2025-12-04T09:41:43.5958405Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5958586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5958721Z Autotune Choices Stats: 2025-12-04T09:41:43.5959623Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5959721Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5959809Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5959916Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5960391Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5960871Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5961349Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5961830Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5962310Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5962786Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5963315Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5963786Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5964300Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5964773Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5965106Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.5965210Z Autotune Choices Stats: 2025-12-04T09:41:43.5966035Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5966140Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5966233Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5966342Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5966820Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5967293Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5967817Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5968330Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5968801Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5969273Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5969737Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5970224Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5970698Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5971176Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5971508Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.5971601Z Autotune Choices Stats: 2025-12-04T09:41:43.5972434Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.5972575Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5972667Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5972779Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5973323Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5973798Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5974273Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5974757Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5975225Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5975696Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5976173Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5976645Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5977169Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5977645Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5978035Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.5978238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5978340Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5978475Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.5978722Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.5979673Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.5979762Z graph_break [] 2025-12-04T09:41:43.5979870Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.5980049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.5980142Z Autotune Choices Stats: 2025-12-04T09:41:43.5980972Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.5981068Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5981157Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5981270Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.5981801Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5982272Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5982781Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5983255Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5983738Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5984218Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5984691Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5985160Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5985628Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5986105Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5986482Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.5986582Z Autotune Choices Stats: 2025-12-04T09:41:43.5987452Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.5987549Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5987646Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5987769Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.5988273Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.5988754Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5989230Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5989712Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5990185Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5990655Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5991171Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5991652Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5992236Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5992710Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5993051Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.5993146Z Autotune Choices Stats: 2025-12-04T09:41:43.5993979Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.5994077Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.5994164Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.5994281Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.5994762Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5995246Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.5995759Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.5996247Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5996775Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5997250Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5997730Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.5998214Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.5998690Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5999164Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.5999577Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.5999751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.5999844Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.5999980Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6000596Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6002047Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6002141Z graph_break [] 2025-12-04T09:41:43.6002244Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6002420Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6002510Z Autotune Choices Stats: 2025-12-04T09:41:43.6003344Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.6003442Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6003531Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6003635Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6004114Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6004584Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6005062Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6005592Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6006080Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6006620Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6007122Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6007605Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6008076Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6008549Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6008882Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.6008980Z Autotune Choices Stats: 2025-12-04T09:41:43.6009813Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6009909Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6010082Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6010186Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6010670Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6011140Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6011651Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6012127Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6012595Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6013064Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6013535Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6014010Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6014482Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6014990Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6015325Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.6015539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6015637Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6015770Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6016015Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6016959Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6017044Z graph_break [] 2025-12-04T09:41:43.6017151Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6017326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6017417Z Autotune Choices Stats: 2025-12-04T09:41:43.6018255Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6018346Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6018435Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6018539Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6019016Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6019506Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6020032Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6020546Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6021016Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6021481Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6021958Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6022425Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6022903Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6023375Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6023710Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.6023802Z Autotune Choices Stats: 2025-12-04T09:41:43.6024670Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:43.6025163Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6025252Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6025361Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6025842Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6026312Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6026787Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6027267Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6027744Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6028214Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6028686Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6029163Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6029692Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6030207Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6030538Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:43.6030638Z Autotune Choices Stats: 2025-12-04T09:41:43.6031472Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6031571Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6031660Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6031770Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6032253Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6032729Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6033199Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6033673Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6034190Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6034708Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6035175Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6035649Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6036114Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6036582Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6036924Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:43.6037127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6037246Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6037378Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6037624Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6039011Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6039138Z graph_break [] 2025-12-04T09:41:43.6039245Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6039419Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6039602Z Autotune Choices Stats: 2025-12-04T09:41:43.6040472Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6040563Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6040651Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6040759Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6041243Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6041720Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6042193Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6042674Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6043146Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6043669Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6044209Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6044689Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6045166Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6045638Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6045983Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:43.6046077Z Autotune Choices Stats: 2025-12-04T09:41:43.6046932Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6047048Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6047148Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6047264Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6047741Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6048261Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6048739Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6049248Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6049720Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6050186Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6050661Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6051138Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6051612Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6052095Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6052426Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:43.6052694Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.6052799Z Traceback (most recent call last): 2025-12-04T09:41:43.6053215Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.6053444Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.6053791Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.6053974Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.6054135Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.6054217Z Searched string: 2025-12-04T09:41:43.6054357Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.6054363Z 2025-12-04T09:41:43.6054479Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.6054487Z 2025-12-04T09:41:43.6054614Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.6054748Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.6054753Z 2025-12-04T09:41:43.6054848Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.6054946Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.6055037Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.6055129Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.6055133Z 2025-12-04T09:41:43.6055226Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.6055316Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.6055406Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.6055498Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.6055502Z 2025-12-04T09:41:43.6055506Z 2025-12-04T09:41:43.6055664Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.6055668Z 2025-12-04T09:41:43.6055672Z 2025-12-04T09:41:43.6055796Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.6055956Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.6056071Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.6056163Z idx_m = rm[:, None] 2025-12-04T09:41:43.6056250Z idx_n = rn[None, :] 2025-12-04T09:41:43.6056347Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.6056358Z 2025-12-04T09:41:43.6056456Z # inductor generates a suffix 2025-12-04T09:41:43.6056548Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.6056803Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.6056897Z ''', device_str='cuda') 2025-12-04T09:41:43.6056902Z 2025-12-04T09:41:43.6056905Z 2025-12-04T09:41:43.6057011Z async_compile.wait(globals()) 2025-12-04T09:41:43.6057098Z del async_compile 2025-12-04T09:41:43.6057103Z 2025-12-04T09:41:43.6057183Z class Runner: 2025-12-04T09:41:43.6057284Z def __init__(self, partitions): 2025-12-04T09:41:43.6057394Z self.partitions = partitions 2025-12-04T09:41:43.6057401Z 2025-12-04T09:41:43.6057514Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.6057611Z new_callables = [] 2025-12-04T09:41:43.6057735Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.6057843Z new_callables.append(fn(c)) 2025-12-04T09:41:43.6057954Z self.partitions = new_callables 2025-12-04T09:41:43.6057958Z 2025-12-04T09:41:43.6058050Z def call(self, args): 2025-12-04T09:41:43.6058140Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.6058233Z args.clear() 2025-12-04T09:41:43.6058369Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.6058506Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.6058616Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.6058714Z torch.cuda.set_device(0) 2025-12-04T09:41:43.6058898Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.6059152Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.6059297Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.6059513Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.6059638Z del arg0_1 2025-12-04T09:41:43.6059814Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.6060108Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.6060215Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.6060463Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.6060547Z del arg1_1 2025-12-04T09:41:43.6060626Z del buf0 2025-12-04T09:41:43.6060715Z return (buf1, ) 2025-12-04T09:41:43.6060719Z 2025-12-04T09:41:43.6060821Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.6060904Z call = runner.call 2025-12-04T09:41:43.6061067Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.6061072Z 2025-12-04T09:41:43.6061078Z 2025-12-04T09:41:43.6061216Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.6061353Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.6061500Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.6061702Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.6061908Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.6062006Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.6062166Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.6062174Z 2025-12-04T09:41:43.6062178Z 2025-12-04T09:41:43.6062266Z if __name__ == "__main__": 2025-12-04T09:41:43.6062463Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.6062668Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.6062750Z From CHECK: .to( 2025-12-04T09:41:43.6062756Z 2025-12-04T09:41:43.6062760Z 2025-12-04T09:41:43.6062935Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.6063497Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.6063502Z 2025-12-04T09:41:43.6063760Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.6067852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6067967Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6068104Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6068357Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6069735Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6069827Z graph_break [] 2025-12-04T09:41:43.6069935Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6070118Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6070213Z Autotune Choices Stats: 2025-12-04T09:41:43.6071061Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6071163Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6071252Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6071426Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6071920Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6072429Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6072898Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6073355Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6073815Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6074294Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6074756Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6075216Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6075672Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6076185Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6076520Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.6076615Z Autotune Choices Stats: 2025-12-04T09:41:43.6077490Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6077606Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6077711Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6077843Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6078319Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6078783Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6079253Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6079778Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6080236Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6080766Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6081227Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6081734Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6082202Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6082666Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6083004Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.6083183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6083278Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6083417Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6083665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6084611Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6084695Z graph_break [] 2025-12-04T09:41:43.6084799Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6084978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6085119Z Autotune Choices Stats: 2025-12-04T09:41:43.6085947Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6086044Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6086131Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6086279Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6086755Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6087224Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6087704Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6088178Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6088664Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6089144Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6089615Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6090125Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6090594Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6091098Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6091428Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.6091524Z Autotune Choices Stats: 2025-12-04T09:41:43.6092346Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6092450Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6092539Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6092652Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6093123Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6093585Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6094053Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6094515Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6095021Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6095488Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6095985Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6096457Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6096931Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6097403Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6097739Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.6097835Z Autotune Choices Stats: 2025-12-04T09:41:43.6098721Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.6098815Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6098909Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6099025Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6099552Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6100054Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6100688Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6101163Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6101635Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6102108Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6102578Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6103050Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6103514Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6103973Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6104383Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.6104561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6104656Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6104792Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6105094Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6106039Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6106126Z graph_break [] 2025-12-04T09:41:43.6106235Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6106414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6106509Z Autotune Choices Stats: 2025-12-04T09:41:43.6107356Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.6107460Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6107550Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6107675Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6108185Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6108649Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6109174Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6109688Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6110165Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6110633Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6111109Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6111582Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6112050Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6112515Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6112841Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.6112937Z Autotune Choices Stats: 2025-12-04T09:41:43.6113760Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.6113898Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6113989Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6114095Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6114631Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6115098Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6115563Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6116034Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6116496Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6116963Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6117424Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6117896Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6118408Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6118913Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6119250Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.6119344Z Autotune Choices Stats: 2025-12-04T09:41:43.6120230Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6120328Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6120415Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6120533Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6121008Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6121491Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6121968Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6122429Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6122943Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6123408Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6123916Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6124378Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6124842Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6125314Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6125644Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.6125822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6125922Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6126062Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6126310Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6127774Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6127871Z graph_break [] 2025-12-04T09:41:43.6128019Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6128198Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6128288Z Autotune Choices Stats: 2025-12-04T09:41:43.6129129Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6129228Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6129314Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6129423Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6129910Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6130376Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6130848Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6131308Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6131776Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6132303Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6132787Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6133295Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6133767Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6134234Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6134573Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.6134668Z Autotune Choices Stats: 2025-12-04T09:41:43.6135507Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6135600Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6135693Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6135797Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6136279Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6136786Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6137267Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6137810Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6138290Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6138753Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6139227Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6139705Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6140184Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6140656Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6140988Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.6141163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6141381Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6141516Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6141764Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6142749Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6142833Z graph_break [] 2025-12-04T09:41:43.6142942Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6143116Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6143209Z Autotune Choices Stats: 2025-12-04T09:41:43.6144050Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.6144147Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6144238Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6144347Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6144829Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6145307Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6145776Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6146291Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6146796Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6147268Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6147741Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6148216Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6148697Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6149173Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6149509Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.6149604Z Autotune Choices Stats: 2025-12-04T09:41:43.6150450Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6150617Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6150705Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6150813Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6151291Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6151808Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6152296Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6152766Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6153242Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6153709Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6154185Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6154660Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6155133Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6155651Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6155984Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.6156120Z Autotune Choices Stats: 2025-12-04T09:41:43.6156959Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6157054Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6157144Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6157256Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6157769Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6158274Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6158743Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6159217Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6159746Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6160220Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6160731Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6161248Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6161724Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6162197Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6162535Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.6162712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6162810Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6162945Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6163191Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6164140Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6164223Z graph_break [] 2025-12-04T09:41:43.6164328Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6164505Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6164601Z Autotune Choices Stats: 2025-12-04T09:41:43.6165473Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6165604Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6165692Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6165804Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6166281Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6166756Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6167246Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6167770Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6168260Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6168738Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6169211Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6169732Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6170206Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6170715Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6171049Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.6171146Z Autotune Choices Stats: 2025-12-04T09:41:43.6171981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6172079Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6172166Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6172274Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6172747Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6173224Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6173700Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6174212Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6174694Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6175201Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6175669Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6176146Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6176625Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6177100Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6177436Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.6177528Z Autotune Choices Stats: 2025-12-04T09:41:43.6178357Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6178451Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6178595Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6178707Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6179185Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6179663Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6180178Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6180661Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6181139Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6181626Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6182116Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6182585Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6183060Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6183565Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6183903Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.6184141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6184235Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6184368Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6184615Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6186021Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6186108Z graph_break [] 2025-12-04T09:41:43.6186213Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6186393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6186484Z Autotune Choices Stats: 2025-12-04T09:41:43.6187358Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6187464Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6187552Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6187660Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6188133Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6188656Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6189131Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6189649Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6190131Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6190608Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6191093Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6191582Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6192063Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6192536Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6192870Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.6193005Z Autotune Choices Stats: 2025-12-04T09:41:43.6193839Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6193974Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6194062Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6194165Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6194650Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6195124Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6195601Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6196071Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6196541Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6197009Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6197526Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6198046Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6198521Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6199035Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6199369Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.6199607Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6199706Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6199840Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6200095Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6201626Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6201714Z graph_break [] 2025-12-04T09:41:43.6201819Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6201993Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6202086Z Autotune Choices Stats: 2025-12-04T09:41:43.6202996Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6203150Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6203240Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6203345Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6203836Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6204308Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6204775Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6205259Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6205731Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6206208Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6206685Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6207157Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6207702Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6208186Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6208576Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.6208677Z Autotune Choices Stats: 2025-12-04T09:41:43.6209518Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6209618Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6209709Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6209819Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6210302Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6210788Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6211262Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6211735Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6212249Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6212755Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6213232Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6213709Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6214188Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6214670Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6215005Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.6215184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6215281Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6215416Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6215665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6217058Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6217207Z graph_break [] 2025-12-04T09:41:43.6217332Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6217509Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6217604Z Autotune Choices Stats: 2025-12-04T09:41:43.6218472Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.6218572Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6218663Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6218778Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6219258Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6219736Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6220214Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6220686Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6221162Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6221707Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6222232Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6222706Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6223175Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6223648Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6223987Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.6224090Z Autotune Choices Stats: 2025-12-04T09:41:43.6224928Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6225025Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6225117Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6225223Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6225713Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6226235Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6226710Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6227233Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6227747Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6228224Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6228697Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6229182Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6229658Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6230133Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6230472Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.6230651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6230789Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6230926Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6231212Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6232597Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6232683Z graph_break [] 2025-12-04T09:41:43.6232794Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6232969Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6233063Z Autotune Choices Stats: 2025-12-04T09:41:43.6233902Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6234000Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6234092Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6234200Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6234678Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6235155Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6235672Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6236161Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6236682Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6237163Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6237651Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6238134Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6238616Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6239087Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6239426Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.6239565Z Autotune Choices Stats: 2025-12-04T09:41:43.6240431Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.6240536Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6240661Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6240768Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6241244Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6241709Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6242180Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6242656Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6243124Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6243591Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6244059Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6244531Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6245048Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6245527Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6245897Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.6246074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6246169Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6246300Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6246552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6247980Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6248072Z graph_break [] 2025-12-04T09:41:43.6248175Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6248350Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6248446Z Autotune Choices Stats: 2025-12-04T09:41:43.6249289Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6249387Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6249472Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6249615Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6250101Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6250621Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6251095Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6251564Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6252034Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6252503Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6252980Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6253456Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6253926Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6254448Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6254778Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.6254868Z Autotune Choices Stats: 2025-12-04T09:41:43.6255779Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6255874Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6255965Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6256068Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6256547Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6257048Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6257557Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6258029Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6258498Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6259004Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6259477Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6259994Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6260473Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6260945Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6261281Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.6261457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6261553Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6261688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6261931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6262878Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6262960Z graph_break [] 2025-12-04T09:41:43.6263063Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6263240Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6263378Z Autotune Choices Stats: 2025-12-04T09:41:43.6264227Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.6264323Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6264408Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6264559Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6265035Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6265504Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6265986Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6266453Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6266935Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6267406Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6267884Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6268403Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6268922Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6269399Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6269735Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.6269829Z Autotune Choices Stats: 2025-12-04T09:41:43.6270658Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6270754Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6270839Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6270945Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6271423Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6271894Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6272363Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6272845Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6273362Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6273880Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6274350Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6274824Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6275302Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6275775Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6276104Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.6276198Z Autotune Choices Stats: 2025-12-04T09:41:43.6277038Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6277129Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6277223Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6277334Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6277907Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6278435Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6278912Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6279397Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6279927Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6280408Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6280905Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6281396Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6281883Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6282364Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6282741Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.6282916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6283008Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6283181Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6283428Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6284368Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6284457Z graph_break [] 2025-12-04T09:41:43.6284560Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6284738Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6284831Z Autotune Choices Stats: 2025-12-04T09:41:43.6285658Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6285755Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6285842Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6285948Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6286423Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6286947Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6287443Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6288062Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6288537Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6289007Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6289483Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6289951Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6290420Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6290889Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6291217Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.6291312Z Autotune Choices Stats: 2025-12-04T09:41:43.6292210Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6292305Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6292395Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6292497Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6293018Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6293491Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6293960Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6294434Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6294905Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6295372Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6295837Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6296355Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6296832Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6297348Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6297683Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.6297776Z Autotune Choices Stats: 2025-12-04T09:41:43.6298609Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6298706Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6298792Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6298910Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6299385Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6299866Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6300500Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6300983Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6301536Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6302018Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6302562Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6303036Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6303510Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6303978Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6304311Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.6304491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6304586Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6304719Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6304965Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6305961Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6306050Z graph_break [] 2025-12-04T09:41:43.6306153Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6306393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6306482Z Autotune Choices Stats: 2025-12-04T09:41:43.6307330Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6307443Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6307539Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6307655Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6308147Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6308624Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6309105Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6309585Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6310072Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6310593Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6311060Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6311588Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6312056Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6312527Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6312866Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.6312963Z Autotune Choices Stats: 2025-12-04T09:41:43.6313790Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.6313889Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6313979Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6314082Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6314561Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6315043Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6315554Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6316063Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6316534Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6317005Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6317476Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6318012Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6318489Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6318963Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6319301Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.6319396Z Autotune Choices Stats: 2025-12-04T09:41:43.6320440Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6320581Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6320669Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6320787Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6321394Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6321955Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6322506Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6323066Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6323622Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6324178Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6324738Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6325295Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6325896Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6326524Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6326909Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.6327106Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6327204Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6327350Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6327629Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6329350Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6329438Z graph_break [] 2025-12-04T09:41:43.6329548Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6329745Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6329837Z Autotune Choices Stats: 2025-12-04T09:41:43.6330830Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6330983Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6331071Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6331179Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6331749Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6332344Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6332915Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6333473Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6334032Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6334583Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6335140Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6335694Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6336251Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6336850Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6337325Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.6337424Z Autotune Choices Stats: 2025-12-04T09:41:43.6338426Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6338521Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6338611Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6338719Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6339293Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6339846Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6340398Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6340952Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6341501Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6342099Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6342649Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6343248Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6343806Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6344362Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6344752Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.6344946Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6345052Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6345193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6345475Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6347194Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6347279Z graph_break [] 2025-12-04T09:41:43.6347431Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6347629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6347761Z Autotune Choices Stats: 2025-12-04T09:41:43.6348823Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.6348921Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6349012Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6353944Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6354462Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6354953Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6355429Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6355901Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6356366Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6356841Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6357381Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6357856Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6358371Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6358848Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6359183Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.6359280Z Autotune Choices Stats: 2025-12-04T09:41:43.6360221Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6360327Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6360416Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6360524Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6361006Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6361478Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6361995Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6362472Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6363013Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6363481Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6363949Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6364429Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6364906Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6365385Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6365714Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.6365894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6365988Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6366121Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6366416Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6367898Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6367994Z graph_break [] 2025-12-04T09:41:43.6368101Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6368276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6368371Z Autotune Choices Stats: 2025-12-04T09:41:43.6369218Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6369319Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6369410Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6369517Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6370008Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6370495Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6370981Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6371503Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6371988Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6372508Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6372982Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6373453Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6373926Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6374401Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6374738Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.6374830Z Autotune Choices Stats: 2025-12-04T09:41:43.6375675Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6375813Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6375903Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6376013Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6376492Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6377010Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6377479Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6377973Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6378476Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6378943Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6379417Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6379893Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6380370Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6380885Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6381224Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.6381437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6381533Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6381670Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6381916Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6383300Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6383385Z graph_break [] 2025-12-04T09:41:43.6383489Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6383669Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6383763Z Autotune Choices Stats: 2025-12-04T09:41:43.6384595Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6384694Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6384783Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6384892Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6385372Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6385898Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6386419Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6386899Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6387380Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6387863Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6388337Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6388817Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6389286Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6389754Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6390130Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.6390229Z Autotune Choices Stats: 2025-12-04T09:41:43.6391071Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6391210Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6391298Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6391402Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6391883Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6392360Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6392844Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6393321Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6393794Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6394262Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6394729Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6395259Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6395777Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6396254Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6396583Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.6396756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6396857Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6396998Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6397292Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6398239Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6398324Z graph_break [] 2025-12-04T09:41:43.6398430Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6398603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6398695Z Autotune Choices Stats: 2025-12-04T09:41:43.6399655Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.6399754Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6399882Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6399988Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6400643Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6401121Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6401593Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6402079Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6402547Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6403030Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6403505Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6403977Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6404578Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6405055Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6405450Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.6405545Z Autotune Choices Stats: 2025-12-04T09:41:43.6406390Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6406487Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6406578Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6406684Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6407162Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6407690Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6408168Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6408636Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6409162Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6409636Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6410164Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6410637Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6411108Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6411594Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6411924Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.6412026Z Autotune Choices Stats: 2025-12-04T09:41:43.6412855Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6412948Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6413037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6413149Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6413624Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6414142Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6414617Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6415131Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6415614Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6416095Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6416576Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6417067Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6417541Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6418012Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6418345Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.6418558Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6418659Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6418848Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6419096Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6420483Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6420566Z graph_break [] 2025-12-04T09:41:43.6420674Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6420854Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6420948Z Autotune Choices Stats: 2025-12-04T09:41:43.6421788Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.6421888Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6421980Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6422085Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6422566Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6423044Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6423562Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6424037Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6424550Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6425029Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6425511Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6425982Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6426470Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6426941Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6427278Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.6427370Z Autotune Choices Stats: 2025-12-04T09:41:43.6428292Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6428430Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6428518Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6428630Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6429110Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6429583Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6430062Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6430535Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6431008Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6431477Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6431950Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6432423Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6432939Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6433415Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6433809Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.6433987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6434083Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6434213Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6434463Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6435840Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6435930Z graph_break [] 2025-12-04T09:41:43.6436037Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6436210Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6436313Z Autotune Choices Stats: 2025-12-04T09:41:43.6437152Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6437295Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6437383Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6437488Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6438061Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6438543Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6439019Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6439541Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6440022Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6440496Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6440964Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6441432Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6441899Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6442422Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6442757Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.6442848Z Autotune Choices Stats: 2025-12-04T09:41:43.6443726Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6443823Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6443914Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6444018Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6444501Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6444976Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6445444Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6445914Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6446381Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6446901Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6447445Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6447922Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6448398Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6448868Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6449206Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.6449378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6449474Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6449607Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6449855Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6451233Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6451360Z graph_break [] 2025-12-04T09:41:43.6451467Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6451642Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6451736Z Autotune Choices Stats: 2025-12-04T09:41:43.6452609Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6452707Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6452795Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6452901Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6453376Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6453864Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6454343Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6454826Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6455309Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6455778Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6456291Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6456801Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6457280Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6457749Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6458107Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.6458223Z Autotune Choices Stats: 2025-12-04T09:41:43.6459086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6459185Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6459272Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6459377Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6459858Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6460330Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6460845Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6461309Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6461820Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6462293Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6462758Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6463237Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6463711Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6464186Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6464516Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.6464690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6464788Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6464919Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6465213Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6466594Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6466718Z graph_break [] 2025-12-04T09:41:43.6466824Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6466998Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6467095Z Autotune Choices Stats: 2025-12-04T09:41:43.6467980Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6468078Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6468169Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6468275Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6468757Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6469229Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6469695Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6470231Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6470702Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6471221Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6471693Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6472168Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6472646Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6473129Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6473468Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.6473561Z Autotune Choices Stats: 2025-12-04T09:41:43.6474397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6474495Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6474581Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6474728Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6475208Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6475726Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6476199Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6476664Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6477186Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6477650Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6478132Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6478605Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6479080Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6479652Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6479983Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.6480161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6480255Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6480431Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6480678Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6481618Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6481707Z graph_break [] 2025-12-04T09:41:43.6481814Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6481988Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6482085Z Autotune Choices Stats: 2025-12-04T09:41:43.6482915Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6483013Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6483100Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6483203Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6483682Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6484197Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6484674Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6485188Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6485669Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6486151Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6486627Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6487097Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6487570Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6488045Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6488423Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.6488587Z Autotune Choices Stats: 2025-12-04T09:41:43.6489434Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6489531Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6489622Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6489770Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6490244Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6490713Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6491184Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6491659Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6492131Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6492598Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6493063Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6493578Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6494054Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6494571Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6494904Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.6494998Z Autotune Choices Stats: 2025-12-04T09:41:43.6495829Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6495927Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6496016Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6496135Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6496611Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6497087Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6497566Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6498091Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6498616Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6499137Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6499622Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6500102Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6500743Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6501222Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6501557Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.6501737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6501832Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6501967Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6502217Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6503662Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6503839Z graph_break [] 2025-12-04T09:41:43.6503944Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6504123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6504220Z Autotune Choices Stats: 2025-12-04T09:41:43.6505049Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6505146Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6505241Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6505353Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6505832Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6506315Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6506788Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6507311Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6507790Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6508320Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6508850Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6509320Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6509795Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6510269Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6510602Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.6510698Z Autotune Choices Stats: 2025-12-04T09:41:43.6511541Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6511636Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6511726Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6511833Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6512313Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6512836Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6513349Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6513825Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6514292Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6514760Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6515231Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6515711Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6516188Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6516661Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6516996Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.6517217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6517314Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6517453Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6517711Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6518730Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6518816Z graph_break [] 2025-12-04T09:41:43.6518920Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6519101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6519198Z Autotune Choices Stats: 2025-12-04T09:41:43.6520100Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6520197Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6520286Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6520398Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6520872Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6521354Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6521868Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6522345Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6522869Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6523346Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6523835Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6524307Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6524789Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6525259Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6525592Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.6525689Z Autotune Choices Stats: 2025-12-04T09:41:43.6526534Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6526672Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6526762Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6526868Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6527349Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6527914Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6528395Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6528870Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6529342Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6529815Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6530282Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6530758Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6531271Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6531758Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6532131Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.6532226Z Autotune Choices Stats: 2025-12-04T09:41:43.6533063Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.6533158Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6533250Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6533367Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6533853Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6534329Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6534802Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6535278Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6535747Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6536264Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6536732Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6537244Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6537722Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6538198Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6538531Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.6538708Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6538802Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6538939Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6539190Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6540568Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6540719Z graph_break [] 2025-12-04T09:41:43.6540830Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6541009Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6541144Z Autotune Choices Stats: 2025-12-04T09:41:43.6541985Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6542078Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6542166Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6542277Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6542757Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6543254Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6543732Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6544205Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6544684Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6545154Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6545668Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6546136Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6546652Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6547122Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6547458Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.6547582Z Autotune Choices Stats: 2025-12-04T09:41:43.6548460Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6548561Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6548654Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6548762Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6549247Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6549723Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6550239Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6550803Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6551274Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6551747Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6552213Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6552699Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6553175Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6553656Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6553989Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.6554165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6554269Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6554443Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6554697Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6556143Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6556236Z graph_break [] 2025-12-04T09:41:43.6556348Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6556524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6556622Z Autotune Choices Stats: 2025-12-04T09:41:43.6557465Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6557581Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6557691Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6557803Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6558292Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6558774Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6559258Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6559827Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6560343Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6560816Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6561289Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6561763Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6562243Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6562723Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6563062Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.6563157Z Autotune Choices Stats: 2025-12-04T09:41:43.6564010Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6564147Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6564233Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6564345Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6564825Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6565348Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6565818Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6566284Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6566763Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6567233Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6567705Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6568186Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6568665Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6569178Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6569549Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.6569727Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6569826Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6569965Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6570219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6571602Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6571697Z graph_break [] 2025-12-04T09:41:43.6571805Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6571984Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6572077Z Autotune Choices Stats: 2025-12-04T09:41:43.6572912Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.6573012Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6573100Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6573211Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6573748Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6574220Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6574762Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6575240Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6575725Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6576202Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6576685Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6577172Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6577647Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6578164Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6578552Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.6578651Z Autotune Choices Stats: 2025-12-04T09:41:43.6579548Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6579644Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6579735Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6579840Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6580329Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6580808Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6581278Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6581753Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6582221Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6582695Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6583215Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6583693Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6584210Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6584685Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6585020Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.6585195Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6585294Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6585429Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6585678Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6586628Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6586712Z graph_break [] 2025-12-04T09:41:43.6586822Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6586998Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6587093Z Autotune Choices Stats: 2025-12-04T09:41:43.6588014Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6588151Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6588239Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6588350Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6588831Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6589313Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6589790Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6590271Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6590754Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6591230Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6591702Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6592169Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6592685Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6593153Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6593524Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.6593624Z Autotune Choices Stats: 2025-12-04T09:41:43.6594451Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6594553Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6594641Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6594749Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6595231Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6595716Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6596188Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6596664Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6597178Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6597648Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6598155Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6598635Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6599108Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6599648Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6599984Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.6600077Z Autotune Choices Stats: 2025-12-04T09:41:43.6601180Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6601278Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6601370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6601486Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6602046Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6602591Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6603061Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6603595Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6604068Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6604545Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6605021Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6605498Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6605974Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6606451Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6606790Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.6607041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6607208Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6607356Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6607602Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6608547Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6608631Z graph_break [] 2025-12-04T09:41:43.6608738Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6608918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6609016Z Autotune Choices Stats: 2025-12-04T09:41:43.6609862Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.6609959Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6610047Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6610158Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6610650Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6611123Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6611672Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6612151Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6612678Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6613153Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6613625Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6614098Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6614570Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6615043Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6615376Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.6615476Z Autotune Choices Stats: 2025-12-04T09:41:43.6616346Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6616450Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6616538Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6616682Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6617162Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6617677Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6618174Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6618647Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6619123Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6619598Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6620065Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6620542Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6621058Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6621536Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6621866Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.6621998Z Autotune Choices Stats: 2025-12-04T09:41:43.6622829Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6622924Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6623017Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6623131Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6623608Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6624089Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6624567Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6625046Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6625570Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6626051Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6626574Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6627050Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6627526Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6627997Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6628330Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.6628505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6628600Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6628736Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6628982Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6630362Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6630493Z graph_break [] 2025-12-04T09:41:43.6630600Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6630786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6630877Z Autotune Choices Stats: 2025-12-04T09:41:43.6631746Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.6631844Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6631931Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6632041Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6632518Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6633002Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6633481Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6637379Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6637879Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6638415Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6638891Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6639401Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6639936Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6640416Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6640751Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.6640849Z Autotune Choices Stats: 2025-12-04T09:41:43.6641678Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6641781Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6641867Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6641973Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6642462Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6642930Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6643460Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6643929Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6644434Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6644905Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6645371Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6645851Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6646326Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6646803Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6647157Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.6647356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6647453Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6647585Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6647875Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6648880Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6648964Z graph_break [] 2025-12-04T09:41:43.6649071Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6649243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6649334Z Autotune Choices Stats: 2025-12-04T09:41:43.6650168Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6650263Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6650352Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6650458Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6650935Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6651424Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6651905Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6652378Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6652888Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6653398Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6653873Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6654347Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6654818Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6655290Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6655625Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.6655719Z Autotune Choices Stats: 2025-12-04T09:41:43.6656546Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:43.6656638Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6656723Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6656830Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6657338Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6657841Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6658314Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6658790Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6659256Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6659727Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6660197Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6660673Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6661145Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6661618Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6661985Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:43.6662083Z Autotune Choices Stats: 2025-12-04T09:41:43.6662952Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6663047Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6663135Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6663244Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6663720Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6664199Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6664671Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6665138Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6665609Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6666092Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6666600Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6667115Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6667582Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6668098Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6668426Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:43.6668603Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6668699Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6668827Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6669076Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6670449Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6670531Z graph_break [] 2025-12-04T09:41:43.6670636Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6670854Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6670944Z Autotune Choices Stats: 2025-12-04T09:41:43.6671797Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6671890Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6672018Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6672124Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6672608Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6673081Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6673551Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6674026Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6674501Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6674980Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6675450Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6675965Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6676474Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6676942Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6677276Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:43.6677367Z Autotune Choices Stats: 2025-12-04T09:41:43.6678261Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6678356Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6678442Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6678546Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6679022Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6679545Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6680020Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6680537Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6681005Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6681510Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6681983Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6682455Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6682938Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6683414Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6683744Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:43.6683918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6684010Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6684141Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6684392Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6685392Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6685520Z graph_break [] 2025-12-04T09:41:43.6685622Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6685794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6685887Z Autotune Choices Stats: 2025-12-04T09:41:43.6686724Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6686818Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6686902Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6687008Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6687491Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6687970Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6688445Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6688912Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6689376Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6689899Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6690365Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6690879Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6691345Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6691814Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6692149Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:43.6692243Z Autotune Choices Stats: 2025-12-04T09:41:43.6693075Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6693166Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6693257Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6693359Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6693834Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6694349Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6694857Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6695329Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6695796Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6696263Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6696732Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6697204Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6697706Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6698202Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6698538Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:43.6698669Z Autotune Choices Stats: 2025-12-04T09:41:43.6699495Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.6699590Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6699674Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6699826Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6700583Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6701056Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6701530Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6702002Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6702481Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6702955Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6703427Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6704036Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6704560Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6705032Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6705361Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:43.6705538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6705631Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6705759Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6706017Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6707392Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6707479Z graph_break [] 2025-12-04T09:41:43.6707584Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6707784Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6707896Z Autotune Choices Stats: 2025-12-04T09:41:43.6708736Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6708888Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6708975Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6709078Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6709609Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6710084Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6710557Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6711037Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6711508Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6711988Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6712458Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6712924Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6713432Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6713937Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6714271Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.6714361Z Autotune Choices Stats: 2025-12-04T09:41:43.6715207Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6715299Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6715391Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6715493Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6715973Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6716447Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6716913Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6717383Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6717851Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6718360Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6718890Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6719364Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6719889Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6720366Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6720699Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:43.6720922Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.6721024Z Traceback (most recent call last): 2025-12-04T09:41:43.6721442Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.6721623Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.6721972Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.6722151Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.6722316Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.6722402Z Searched string: 2025-12-04T09:41:43.6722576Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.6722582Z 2025-12-04T09:41:43.6722698Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.6722748Z 2025-12-04T09:41:43.6722877Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.6722999Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.6723004Z 2025-12-04T09:41:43.6723104Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.6723191Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.6723282Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.6723375Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.6723380Z 2025-12-04T09:41:43.6723467Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.6723555Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.6723646Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.6723733Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.6723739Z 2025-12-04T09:41:43.6723743Z 2025-12-04T09:41:43.6723907Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.6723912Z 2025-12-04T09:41:43.6723916Z 2025-12-04T09:41:43.6724037Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.6724150Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.6724265Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.6724348Z idx_m = rm[:, None] 2025-12-04T09:41:43.6724436Z idx_n = rn[None, :] 2025-12-04T09:41:43.6724527Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.6724532Z 2025-12-04T09:41:43.6724626Z # inductor generates a suffix 2025-12-04T09:41:43.6724718Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.6724928Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.6725012Z ''', device_str='cuda') 2025-12-04T09:41:43.6725017Z 2025-12-04T09:41:43.6725063Z 2025-12-04T09:41:43.6725167Z async_compile.wait(globals()) 2025-12-04T09:41:43.6725251Z del async_compile 2025-12-04T09:41:43.6725258Z 2025-12-04T09:41:43.6725342Z class Runner: 2025-12-04T09:41:43.6725444Z def __init__(self, partitions): 2025-12-04T09:41:43.6725546Z self.partitions = partitions 2025-12-04T09:41:43.6725551Z 2025-12-04T09:41:43.6725659Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.6725746Z new_callables = [] 2025-12-04T09:41:43.6725860Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.6726007Z new_callables.append(fn(c)) 2025-12-04T09:41:43.6726109Z self.partitions = new_callables 2025-12-04T09:41:43.6726113Z 2025-12-04T09:41:43.6726202Z def call(self, args): 2025-12-04T09:41:43.6726288Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.6726369Z args.clear() 2025-12-04T09:41:43.6726496Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.6726619Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.6726724Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.6726825Z torch.cuda.set_device(0) 2025-12-04T09:41:43.6726991Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.6727211Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.6727308Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.6727498Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.6727602Z del arg0_1 2025-12-04T09:41:43.6727786Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.6728040Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.6728137Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.6728352Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.6728435Z del arg1_1 2025-12-04T09:41:43.6728515Z del buf0 2025-12-04T09:41:43.6728639Z return (buf1, ) 2025-12-04T09:41:43.6728644Z 2025-12-04T09:41:43.6728745Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.6728866Z call = runner.call 2025-12-04T09:41:43.6729024Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.6729028Z 2025-12-04T09:41:43.6729032Z 2025-12-04T09:41:43.6729170Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.6729302Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.6729447Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.6729653Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.6729849Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.6729948Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.6730112Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.6730117Z 2025-12-04T09:41:43.6730121Z 2025-12-04T09:41:43.6730209Z if __name__ == "__main__": 2025-12-04T09:41:43.6730410Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.6730568Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.6730649Z From CHECK: .to( 2025-12-04T09:41:43.6730656Z 2025-12-04T09:41:43.6730660Z 2025-12-04T09:41:43.6730836Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.6731384Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.6731389Z 2025-12-04T09:41:43.6731607Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.6731778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6731923Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6732051Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6732301Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6733720Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6733805Z graph_break [] 2025-12-04T09:41:43.6733910Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6734085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6734175Z Autotune Choices Stats: 2025-12-04T09:41:43.6735023Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6735118Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6735203Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6735309Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6735794Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6736255Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6736719Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6737234Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6737743Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6738257Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6738715Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6739171Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6739635Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6740102Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6740439Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.6740531Z Autotune Choices Stats: 2025-12-04T09:41:43.6741377Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6741513Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6741600Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6741703Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6742178Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6742674Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6743133Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6743588Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6744049Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6744506Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6744963Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6745429Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6745893Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6746400Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6746766Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.6746939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6747038Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6747168Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6747415Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6748360Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6748444Z graph_break [] 2025-12-04T09:41:43.6748551Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6748723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6748818Z Autotune Choices Stats: 2025-12-04T09:41:43.6749644Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6749735Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6749822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6749924Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6750399Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6750935Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6751407Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6751923Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6752399Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6752884Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6753351Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6753823Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6754292Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6754754Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6755085Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.6755178Z Autotune Choices Stats: 2025-12-04T09:41:43.6756041Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6756169Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6756254Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6756360Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6756825Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6757296Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6757829Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6758291Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6758768Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6759229Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6759753Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6760266Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6760745Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6761514Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6761849Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.6761947Z Autotune Choices Stats: 2025-12-04T09:41:43.6762792Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.6762898Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6762987Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6763097Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6763587Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6764049Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6764512Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6765024Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6765497Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6766016Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6766482Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6766959Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6767432Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6767941Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6768278Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.6768454Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6768554Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6768683Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6768934Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6769880Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6770005Z graph_break [] 2025-12-04T09:41:43.6770110Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6770283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6770377Z Autotune Choices Stats: 2025-12-04T09:41:43.6771251Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.6771345Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6771433Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6771541Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6772023Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6772491Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6772955Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6773422Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6773889Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6774402Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6774907Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6775385Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6775850Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6776313Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6776650Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.6776742Z Autotune Choices Stats: 2025-12-04T09:41:43.6777570Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.6777663Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6777749Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6777856Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6778325Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6778795Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6779302Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6779802Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6780267Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6780729Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6781198Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6781666Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6782141Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6782614Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6782941Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.6783038Z Autotune Choices Stats: 2025-12-04T09:41:43.6783919Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6784080Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6784168Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6784276Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6784758Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6785233Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6785713Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6786181Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6786646Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6787122Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6787631Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6788098Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6788608Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6789121Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6789448Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.6789621Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6789716Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6789844Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6790092Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6791477Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6791565Z graph_break [] 2025-12-04T09:41:43.6791673Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6791845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6791938Z Autotune Choices Stats: 2025-12-04T09:41:43.6792785Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6792918Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6793008Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6793149Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6793630Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6794095Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6794560Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6795030Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6795498Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6795969Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6796442Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6796912Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6797387Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6797946Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6798285Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.6798380Z Autotune Choices Stats: 2025-12-04T09:41:43.6799247Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6799343Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6799428Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6799587Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6800066Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6800707Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6801188Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6801648Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6802110Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6802642Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6803180Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6803656Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6804129Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6804601Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6804937Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.6805117Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6805213Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6805342Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6805591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6806532Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6806619Z graph_break [] 2025-12-04T09:41:43.6806723Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6806953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6807052Z Autotune Choices Stats: 2025-12-04T09:41:43.6807898Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.6808053Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6808139Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6808242Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6808724Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6809195Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6809679Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6810149Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6810617Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6811087Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6811592Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6812074Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6812585Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6813060Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6813389Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.6813480Z Autotune Choices Stats: 2025-12-04T09:41:43.6814316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6814414Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6814504Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6814606Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6815078Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6815552Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6816025Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6816541Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6817010Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6817512Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6817993Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6818500Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6818980Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6819457Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6819791Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.6819881Z Autotune Choices Stats: 2025-12-04T09:41:43.6820733Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6820831Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6820981Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6821098Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6821575Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6822094Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6822559Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6823023Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6823502Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6823976Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6824448Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6824919Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6825389Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6825904Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6826232Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.6826408Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6826540Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6826669Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6826918Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6827903Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6827996Z graph_break [] 2025-12-04T09:41:43.6828101Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6828275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6828371Z Autotune Choices Stats: 2025-12-04T09:41:43.6829201Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6829297Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6829382Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6829485Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6829959Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6830477Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6830990Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6831482Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6831957Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6832439Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6832914Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6833394Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6833862Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6834340Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6834672Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.6834805Z Autotune Choices Stats: 2025-12-04T09:41:43.6835635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.6835728Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6835850Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6835959Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6836434Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6836914Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6837394Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6837875Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6838348Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6838818Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6839286Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6839852Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6840369Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6840841Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6841173Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.6841264Z Autotune Choices Stats: 2025-12-04T09:41:43.6842091Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6842193Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6842281Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6842390Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6842869Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6843341Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6843814Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6844336Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6844816Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6845387Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6845871Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6846348Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6846825Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6847296Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6847659Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.6847851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6847943Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6848074Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6848323Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6849745Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6849870Z graph_break [] 2025-12-04T09:41:43.6849973Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6850149Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6850243Z Autotune Choices Stats: 2025-12-04T09:41:43.6851076Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6851179Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6851265Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6851373Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6851850Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6852326Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6852801Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6853282Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6853803Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6854293Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6854836Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6855326Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.6855804Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6856282Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6856617Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.6856706Z Autotune Choices Stats: 2025-12-04T09:41:43.6857571Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6857682Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6857775Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6857881Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6858405Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6858890Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6859400Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6859871Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6860336Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6860804Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6861273Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6861748Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6862222Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6862691Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6863065Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.6863238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6863334Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6863469Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6863716Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6865142Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6865228Z graph_break [] 2025-12-04T09:41:43.6865333Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6865512Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6865602Z Autotune Choices Stats: 2025-12-04T09:41:43.6866450Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6866546Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6866632Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6866739Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6867222Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6867735Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6868213Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6868724Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6869192Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6869657Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6870136Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6870614Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6871091Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6871565Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6871894Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.6871988Z Autotune Choices Stats: 2025-12-04T09:41:43.6872870Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6872965Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6873051Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6873155Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6873678Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6874154Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6874626Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6875105Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6875573Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6876042Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6876509Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6877025Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6877498Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6878066Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6878393Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.6878565Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6878660Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6878790Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6879039Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6880464Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6880550Z graph_break [] 2025-12-04T09:41:43.6880658Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6880834Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6880924Z Autotune Choices Stats: 2025-12-04T09:41:43.6881763Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.6881899Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6881989Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6882097Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6882572Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6883089Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6883561Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6884048Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6884523Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6885012Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6885490Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6885956Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6886467Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6886939Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6887311Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.6887405Z Autotune Choices Stats: 2025-12-04T09:41:43.6888295Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6888393Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6888482Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6888591Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6889076Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6889549Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6890028Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6890500Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6890967Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6891501Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6891972Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6892480Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6892952Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6893432Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6893761Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.6893938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6894030Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6894159Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6894409Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6895774Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6895901Z graph_break [] 2025-12-04T09:41:43.6896005Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6896218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6899614Z Autotune Choices Stats: 2025-12-04T09:41:43.6900652Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6900752Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6900838Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6900942Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6901440Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6901929Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6902409Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6902894Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6903378Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6903862Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6904442Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6904981Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6905455Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6905924Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6906255Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.6906350Z Autotune Choices Stats: 2025-12-04T09:41:43.6907189Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.6907284Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6907373Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6907485Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6907998Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6908471Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6909003Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6909544Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6910017Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6910488Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6910960Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6911441Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6911919Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6912395Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6912728Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.6912903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6912997Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6913174Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6913421Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6914844Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6914930Z graph_break [] 2025-12-04T09:41:43.6915033Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6915208Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6915298Z Autotune Choices Stats: 2025-12-04T09:41:43.6916146Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6916245Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6916328Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6916435Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6916919Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6917394Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6917888Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6918436Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6918954Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6919426Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6919965Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6920439Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6920920Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6921398Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6921732Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.6921826Z Autotune Choices Stats: 2025-12-04T09:41:43.6922666Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6922809Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6922895Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6923000Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6923483Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6923998Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6924477Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6924945Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6925415Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6925888Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6926361Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6926838Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6927362Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6927880Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6928271Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.6928444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6928542Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6928672Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6928923Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6929867Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6929954Z graph_break [] 2025-12-04T09:41:43.6930061Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6930233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6930328Z Autotune Choices Stats: 2025-12-04T09:41:43.6931164Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.6931255Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6931342Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6931444Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6931918Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6932445Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6932916Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6933424Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6933899Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6934373Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6934853Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6935335Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6935820Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6936290Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6936622Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.6936715Z Autotune Choices Stats: 2025-12-04T09:41:43.6937650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6937779Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6937865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6937970Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6938446Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6938921Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6939402Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6939874Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6940348Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6940815Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6941284Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6941820Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6942307Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6942821Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6943153Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.6943251Z Autotune Choices Stats: 2025-12-04T09:41:43.6944097Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6944192Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6944282Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6944391Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6944883Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6945358Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6945836Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6946360Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6946839Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6947361Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6947893Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6948376Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6948863Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6949348Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6949676Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.6949849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6949946Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6950074Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6950322Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6951273Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6951398Z graph_break [] 2025-12-04T09:41:43.6951502Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6951672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6951801Z Autotune Choices Stats: 2025-12-04T09:41:43.6952639Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6952731Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6952819Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6952925Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6953407Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6953893Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6954369Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6954854Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6955325Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6955841Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6956348Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6956819Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6957338Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6957804Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6958142Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.6958235Z Autotune Choices Stats: 2025-12-04T09:41:43.6959086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6959182Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6959266Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6959371Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6959885Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6960406Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6960882Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6961391Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6961863Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6962329Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6962805Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6963281Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6963756Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6964233Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6964564Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.6964660Z Autotune Choices Stats: 2025-12-04T09:41:43.6965557Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6965690Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6965774Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6965885Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6966363Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6966837Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6967315Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6967850Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6968333Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6968818Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6969301Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6969824Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6970297Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6970808Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6971137Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.6971308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6971405Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6971534Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6971786Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6972732Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6972817Z graph_break [] 2025-12-04T09:41:43.6972923Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6973095Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6973184Z Autotune Choices Stats: 2025-12-04T09:41:43.6974027Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6974123Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6974250Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6974356Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6974873Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6975352Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6975827Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6976311Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6976798Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6977275Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6977798Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6978270Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6978740Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6979253Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6979586Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.6979677Z Autotune Choices Stats: 2025-12-04T09:41:43.6980542Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.6980638Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6980723Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6980830Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.6981311Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6981784Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6982261Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6982728Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6983200Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6983707Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6984216Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6984692Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6985166Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6985643Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6985973Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.6986066Z Autotune Choices Stats: 2025-12-04T09:41:43.6986908Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.6987000Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6987090Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6987198Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.6987733Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6988209Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6988729Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6989241Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6989710Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6990181Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6990660Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6991140Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6991620Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.6992098Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6992426Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.6992598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.6992697Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.6992872Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.6993121Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.6994548Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.6994631Z graph_break [] 2025-12-04T09:41:43.6994738Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.6994911Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.6995004Z Autotune Choices Stats: 2025-12-04T09:41:43.6995847Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.6995941Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.6996028Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.6996133Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.6996609Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.6997091Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.6997717Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6998194Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.6998729Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.6999198Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.6999780Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7000417Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7000897Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7001378Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7001710Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.7001800Z Autotune Choices Stats: 2025-12-04T09:41:43.7002711Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7002811Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7002895Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7003053Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7003532Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7004004Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7004475Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7004946Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7005419Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7005892Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7006361Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7006838Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7007416Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7007896Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7008227Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.7008454Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7008548Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7008679Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7008927Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7010306Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7010398Z graph_break [] 2025-12-04T09:41:43.7010500Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7010673Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7010765Z Autotune Choices Stats: 2025-12-04T09:41:43.7011607Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.7011701Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7011789Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7011892Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7012413Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7012924Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7013396Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7013867Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7014336Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7014825Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7015302Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7015777Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7016254Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7016775Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7017108Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.7017201Z Autotune Choices Stats: 2025-12-04T09:41:43.7018074Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7018168Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7018254Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7018357Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7018831Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7019307Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7019779Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7020260Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7020729Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7021197Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7021708Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7022230Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7022709Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7023183Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7023513Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.7023690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7023783Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7023918Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7024164Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7025550Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7025632Z graph_break [] 2025-12-04T09:41:43.7025733Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7025951Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7026043Z Autotune Choices Stats: 2025-12-04T09:41:43.7026885Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7027017Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7027102Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7027211Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7027743Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7028224Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7028706Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7029188Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7029674Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7030148Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7030664Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7031150Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7031661Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7032127Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7032456Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.7032550Z Autotune Choices Stats: 2025-12-04T09:41:43.7033395Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7033494Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7033578Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7033681Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7034173Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7034642Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7035114Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7035651Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7036120Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7036629Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7037106Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7037629Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7038111Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7038590Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7038924Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.7039096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7039190Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7039325Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7039626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7041052Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7041171Z graph_break [] 2025-12-04T09:41:43.7041281Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7041455Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7041549Z Autotune Choices Stats: 2025-12-04T09:41:43.7042382Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7042476Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7042568Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7042671Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7043153Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7043639Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7044116Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7044600Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7045121Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7045611Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7046118Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7046599Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7047108Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7047597Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7047940Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.7048032Z Autotune Choices Stats: 2025-12-04T09:41:43.7048875Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7048967Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7049052Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7049160Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7049681Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7050161Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7050675Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7051150Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7051624Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7052099Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7052575Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7053052Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7053532Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7054005Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7054379Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.7054554Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7054650Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7054782Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7055066Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7056016Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7056103Z graph_break [] 2025-12-04T09:41:43.7056206Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7056381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7056476Z Autotune Choices Stats: 2025-12-04T09:41:43.7057337Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.7057443Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7057528Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7057656Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7058172Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7058643Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7059158Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7059675Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7060152Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7060634Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7061104Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7061586Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7062059Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7062540Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7062870Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.7062960Z Autotune Choices Stats: 2025-12-04T09:41:43.7063793Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7063928Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7064020Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7064122Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7064643Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7065135Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7065605Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7066082Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7066554Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7067030Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7067549Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7068026Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7068547Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7069083Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7069422Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.7069514Z Autotune Choices Stats: 2025-12-04T09:41:43.7070363Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7070461Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7070545Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7070660Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7071136Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7071612Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7072090Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7072562Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7073095Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7073574Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7074101Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7074589Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7075062Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7075540Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7075875Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.7076051Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7076149Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7076278Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7076523Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7077941Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7078036Z graph_break [] 2025-12-04T09:41:43.7078178Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7078355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7078446Z Autotune Choices Stats: 2025-12-04T09:41:43.7079294Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.7079390Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7079521Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7079631Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7080134Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7080614Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7081095Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7081571Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7082054Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7082575Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7083048Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7083560Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7084037Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7084515Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7084852Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.7084943Z Autotune Choices Stats: 2025-12-04T09:41:43.7085802Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7085895Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7085983Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7086087Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7086567Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7087086Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7087572Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7088134Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7088601Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7089074Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7089547Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7090021Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7090504Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7090980Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7091314Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.7091487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7091638Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7091775Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7092023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7093444Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7093527Z graph_break [] 2025-12-04T09:41:43.7093631Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7093808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7093902Z Autotune Choices Stats: 2025-12-04T09:41:43.7094744Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7094839Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7094924Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7095033Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7095513Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7096002Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7096512Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7096991Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7097515Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7098039Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7098513Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7098986Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7099461Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7099931Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7100396Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.7100493Z Autotune Choices Stats: 2025-12-04T09:41:43.7101334Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7101498Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7101583Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7101693Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7102179Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7102702Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7103177Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7103649Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7104126Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7104600Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7105068Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7105548Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7106105Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7106585Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7106969Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.7107145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7107248Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7107377Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7107627Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7109009Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7109100Z graph_break [] 2025-12-04T09:41:43.7109204Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7109379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7109472Z Autotune Choices Stats: 2025-12-04T09:41:43.7110322Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7110414Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7110544Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7110646Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7111128Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7111613Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7112125Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7112610Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7113097Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7113573Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7114048Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7114525Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7114999Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7115509Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7115844Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.7115975Z Autotune Choices Stats: 2025-12-04T09:41:43.7116818Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7116910Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7116993Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7117100Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7117592Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7118115Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7118585Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7119061Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7119581Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7120050Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7120570Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7121047Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7121562Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7122034Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7122363Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.7122550Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7122645Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7122782Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7123025Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7124405Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7124491Z graph_break [] 2025-12-04T09:41:43.7124593Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7124772Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7124910Z Autotune Choices Stats: 2025-12-04T09:41:43.7125758Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7125897Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7125982Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7126090Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7126563Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7127030Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7127508Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7128021Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7128503Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7128976Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7129454Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7129973Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7130446Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7130974Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7131303Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.7131399Z Autotune Choices Stats: 2025-12-04T09:41:43.7132252Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7132349Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7132437Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7132538Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7133021Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7133504Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7133975Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7134484Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7134989Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7135460Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7135925Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7136399Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7136879Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7137354Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7137689Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.7137862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7137957Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7138085Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7138328Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7139317Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7139403Z graph_break [] 2025-12-04T09:41:43.7139508Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7139681Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7139834Z Autotune Choices Stats: 2025-12-04T09:41:43.7140667Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7140759Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7140846Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7140954Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7141433Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7141915Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7142392Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7142871Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7143390Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7143887Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7144456Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7144924Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7145399Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7145869Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7146205Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.7146299Z Autotune Choices Stats: 2025-12-04T09:41:43.7147129Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7147224Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7147311Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7147414Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7147939Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7148455Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7148929Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7149441Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7149913Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7150380Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7150853Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7151336Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7151811Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7152287Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7152615Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.7152712Z Autotune Choices Stats: 2025-12-04T09:41:43.7153578Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7153707Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7153799Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7153911Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7154388Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7154865Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7155343Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7155820Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7156300Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7156779Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7157257Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7157816Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7158320Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7158833Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7159169Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.7159342Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7159438Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7159626Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7159874Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7161262Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7161348Z graph_break [] 2025-12-04T09:41:43.7161455Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7161626Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7161716Z Autotune Choices Stats: 2025-12-04T09:41:43.7162610Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7162744Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7162837Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7162941Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7163427Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7163904Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7164373Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7164864Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7165340Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7165815Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7166301Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7166767Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7167300Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7167769Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7168139Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.7168231Z Autotune Choices Stats: 2025-12-04T09:41:43.7169071Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7169170Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7169256Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7169361Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7169844Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7170331Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7170809Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7171281Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7171793Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7172261Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7172765Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7173238Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7173709Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7174191Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7174519Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.7174698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7174792Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7174927Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7175176Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7176115Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7176265Z graph_break [] 2025-12-04T09:41:43.7176371Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7176544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7176639Z Autotune Choices Stats: 2025-12-04T09:41:43.7177510Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7177603Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7177693Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7181156Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7181676Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7182168Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7182643Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7183121Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7183601Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7184080Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7184627Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7185145Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7185624Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7186091Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7186424Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.7186518Z Autotune Choices Stats: 2025-12-04T09:41:43.7187358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7187453Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7187537Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7187645Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7188169Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7188639Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7189150Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7189631Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7190138Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7190604Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7191072Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7191546Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7192023Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7192497Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7192827Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.7192920Z Autotune Choices Stats: 2025-12-04T09:41:43.7193776Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.7193876Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7193999Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7194110Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7194585Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7195051Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7195526Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7195996Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7196462Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7196931Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7197396Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7197867Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7198380Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7198853Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7199217Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.7199394Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7199556Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7199688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7199945Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7201593Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7201685Z graph_break [] 2025-12-04T09:41:43.7201787Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7201962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7202054Z Autotune Choices Stats: 2025-12-04T09:41:43.7202885Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7202980Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7203068Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7203248Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7203729Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7204267Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7204741Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7205212Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7205681Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7206150Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7206621Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7207090Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7207591Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7208137Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7208471Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.7208561Z Autotune Choices Stats: 2025-12-04T09:41:43.7209455Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7209549Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7209633Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7209739Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7210215Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7210699Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7211168Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7211635Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7212103Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7212606Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7213086Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7213626Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7214100Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7214569Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7214901Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.7215078Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7215174Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7215306Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7215554Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7216920Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7217004Z graph_break [] 2025-12-04T09:41:43.7217148Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7217326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7217415Z Autotune Choices Stats: 2025-12-04T09:41:43.7218349Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7218448Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7218532Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7218636Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7219110Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7219592Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7220076Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7220546Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7221022Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7221487Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7222002Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7222472Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7222982Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7223465Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7223794Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.7223886Z Autotune Choices Stats: 2025-12-04T09:41:43.7224729Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7224824Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7224911Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7225013Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7225495Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7225961Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7226427Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7226945Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7227413Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7227917Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7228392Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7228876Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7229349Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7229828Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7230161Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.7230333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7230432Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7230563Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7230807Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7232226Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7232347Z graph_break [] 2025-12-04T09:41:43.7232451Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7232623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7232713Z Autotune Choices Stats: 2025-12-04T09:41:43.7233547Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.7233643Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7233731Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7233834Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7234312Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7234787Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7235253Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7235726Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7236240Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7236713Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7237230Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7237735Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7238235Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7238708Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7239044Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.7239135Z Autotune Choices Stats: 2025-12-04T09:41:43.7240024Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7240119Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7240203Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7240310Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7240836Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7241350Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7241822Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7242288Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7242758Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7243227Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7243699Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7244174Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7244646Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7245120Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7245492Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.7245669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7245762Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7245893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7246178Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7247130Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7247233Z graph_break [] 2025-12-04T09:41:43.7247347Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7247534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7247626Z Autotune Choices Stats: 2025-12-04T09:41:43.7248456Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7248557Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7248643Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7248746Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7249224Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7249699Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7250239Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7250759Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7251237Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7251712Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7252184Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7252656Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7253127Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7253596Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7253927Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.7254017Z Autotune Choices Stats: 2025-12-04T09:41:43.7254851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7254985Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7255072Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7255174Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7255687Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7256165Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7256631Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7257108Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7257579Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7258045Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7258513Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7258984Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7259499Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7260006Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7260339Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.7260430Z Autotune Choices Stats: 2025-12-04T09:41:43.7261264Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7261361Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7261449Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7261558Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7262034Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7262506Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7262975Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7263449Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7263963Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7264432Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7264944Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7265419Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7265891Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7266371Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7266699Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.7266874Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7266968Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7267100Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7267355Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7268345Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7268470Z graph_break [] 2025-12-04T09:41:43.7268574Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7268746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7268878Z Autotune Choices Stats: 2025-12-04T09:41:43.7269718Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.7269810Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7269899Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7270001Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7270485Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7270960Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7271431Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7271909Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7272387Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7272860Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7273369Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7273841Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7274345Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7274816Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7275148Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.7275242Z Autotune Choices Stats: 2025-12-04T09:41:43.7276073Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7276168Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7276254Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7276358Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7276830Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7277349Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7277879Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7278385Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7278860Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7279325Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7279852Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7280332Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7280809Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7281283Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7281611Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.7281704Z Autotune Choices Stats: 2025-12-04T09:41:43.7282534Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7282671Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7282758Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7282867Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7283407Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7283883Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7284359Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7284843Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7285332Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7285811Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7286283Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7286762Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7287276Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7287784Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7288116Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.7288289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7288384Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7288514Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7288762Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7290155Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7290245Z graph_break [] 2025-12-04T09:41:43.7290348Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7290522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7290615Z Autotune Choices Stats: 2025-12-04T09:41:43.7291447Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.7291582Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7291671Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7291775Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7292254Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7292765Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7293241Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7293718Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7294198Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7294681Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7295153Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7295622Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7296088Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7296652Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7297029Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.7297120Z Autotune Choices Stats: 2025-12-04T09:41:43.7298022Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7298113Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7298197Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7298302Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7298779Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7299255Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7299734Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7300204Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7300819Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7301364Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7301835Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7302367Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7302846Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7303319Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7303651Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.7303826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7303921Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7304057Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7304303Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7305253Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7305337Z graph_break [] 2025-12-04T09:41:43.7305440Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7305614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7305707Z Autotune Choices Stats: 2025-12-04T09:41:43.7306611Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7306755Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7306842Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7306945Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7307470Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7307961Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7308453Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7308927Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7309401Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7309869Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7310337Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7310852Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7311322Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7311836Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7312169Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.7312263Z Autotune Choices Stats: 2025-12-04T09:41:43.7313100Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:43.7313194Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7313284Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7313387Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7313868Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7314344Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7314810Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7315333Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7315807Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7316316Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7316781Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7317257Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7317784Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7318260Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7318591Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:43.7318682Z Autotune Choices Stats: 2025-12-04T09:41:43.7319572Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7319664Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7319817Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7319929Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7320405Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7320887Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7321397Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7321865Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7322341Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7322821Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7323300Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7323773Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7324243Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7324752Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7325081Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:43.7325296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7325389Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7325531Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7325776Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7327155Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7327242Z graph_break [] 2025-12-04T09:41:43.7327347Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7327526Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7327622Z Autotune Choices Stats: 2025-12-04T09:41:43.7328526Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7328621Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7328706Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7328817Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7329302Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7329813Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7330329Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7330805Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7331283Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7331771Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7332245Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7332726Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7333192Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7333662Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7333995Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:43.7334126Z Autotune Choices Stats: 2025-12-04T09:41:43.7334965Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7335099Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7335189Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7335291Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7335776Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7336249Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7336727Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7337201Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7337716Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7338183Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7338648Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7339163Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7339675Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7340146Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7340477Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:43.7340649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7340747Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7340881Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7341127Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7342077Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7342158Z graph_break [] 2025-12-04T09:41:43.7342263Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7342438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7342528Z Autotune Choices Stats: 2025-12-04T09:41:43.7343407Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7343504Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7343632Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7343739Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7344228Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7344710Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7345178Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7345651Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7346124Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7346601Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7347075Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7347546Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7348057Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7348525Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7348897Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:43.7348994Z Autotune Choices Stats: 2025-12-04T09:41:43.7349822Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7349915Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7350003Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7350107Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7350585Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7351061Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7351538Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7352005Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7352513Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7352988Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7353522Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7354003Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7354473Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7354951Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7355280Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:43.7355373Z Autotune Choices Stats: 2025-12-04T09:41:43.7356208Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.7356301Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7356390Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7356499Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7356978Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7357493Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7358009Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7358523Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7358996Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7359526Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7360004Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7360480Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7360954Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7361419Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7361756Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:43.7361974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7362068Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7362239Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7362485Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7363864Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7363948Z graph_break [] 2025-12-04T09:41:43.7364053Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7364228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7364319Z Autotune Choices Stats: 2025-12-04T09:41:43.7365163Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7365260Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7365345Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7365451Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7365931Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7366410Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7366938Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7367465Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7367991Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7368468Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7368945Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7369426Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7369902Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7370369Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7370707Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.7370800Z Autotune Choices Stats: 2025-12-04T09:41:43.7371682Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7371818Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7371903Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7372005Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7372488Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7372958Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7373429Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7373905Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7374380Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7374850Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7375317Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7375795Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7376313Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7376791Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7377156Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:43.7377334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7377429Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7377584Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7377859Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7378806Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7378891Z graph_break [] 2025-12-04T09:41:43.7378995Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7379168Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7379264Z Autotune Choices Stats: 2025-12-04T09:41:43.7380107Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7380199Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7380290Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7380396Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7380919Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7381426Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7381902Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7382384Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7382861Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7383345Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7383814Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7384293Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7384764Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7385237Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7385612Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:43.7385705Z Autotune Choices Stats: 2025-12-04T09:41:43.7386580Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7386676Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7386761Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7386870Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7387345Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7387875Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7388352Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7388823Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7389296Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7389763Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7390302Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7390815Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7391295Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7391769Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7392101Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:43.7392197Z Autotune Choices Stats: 2025-12-04T09:41:43.7393040Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.7393136Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7393222Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7393332Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7393814Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7394284Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7394800Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7395277Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7395797Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7396279Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7396755Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7397243Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7397758Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7398229Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7398558Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:43.7398730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7398827Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7398957Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7399249Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7400711Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7400875Z graph_break [] 2025-12-04T09:41:43.7400992Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7401171Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7401260Z Autotune Choices Stats: 2025-12-04T09:41:43.7402107Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7402207Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7402295Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7402402Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7402885Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7403364Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7403836Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7404307Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7404846Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7405401Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7406004Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7406555Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7407107Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7407667Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7408057Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:43.7408151Z Autotune Choices Stats: 2025-12-04T09:41:43.7409152Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7409247Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7409335Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7409448Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7410063Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7410542Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7411062Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7411528Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7411996Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7412469Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7412938Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7413413Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7413894Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7414366Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7414734Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:43.7414829Z Autotune Choices Stats: 2025-12-04T09:41:43.7415711Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:43.7415811Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7415898Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7416006Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7416492Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7416964Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7417434Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7417955Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7418424Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7418893Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7419406Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7419881Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7420393Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7420866Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7421194Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:43.7421368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7421467Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7421598Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7421845Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7423234Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7423317Z graph_break [] 2025-12-04T09:41:43.7423421Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7423592Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7423728Z Autotune Choices Stats: 2025-12-04T09:41:43.7424563Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7424656Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7424746Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7424914Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7425391Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7425869Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7426352Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7426839Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7427313Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7427817Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7428307Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7428815Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7429323Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7429798Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7430129Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:43.7430219Z Autotune Choices Stats: 2025-12-04T09:41:43.7431068Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:43.7431164Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7431249Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7431360Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7431843Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7432327Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7432800Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7433268Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7433781Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7434286Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7434767Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7435239Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7435719Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7436193Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7436524Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:43.7436751Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.7436853Z Traceback (most recent call last): 2025-12-04T09:41:43.7437270Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.7437455Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.7437802Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.7438029Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.7438193Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.7438315Z Searched string: 2025-12-04T09:41:43.7438453Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.7438459Z 2025-12-04T09:41:43.7438574Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.7438579Z 2025-12-04T09:41:43.7438711Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.7438834Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.7438838Z 2025-12-04T09:41:43.7438929Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.7439023Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.7439116Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.7439204Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.7439208Z 2025-12-04T09:41:43.7439298Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.7439386Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.7439530Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.7439617Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.7439625Z 2025-12-04T09:41:43.7439629Z 2025-12-04T09:41:43.7439786Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.7439790Z 2025-12-04T09:41:43.7439794Z 2025-12-04T09:41:43.7439915Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.7440031Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.7440144Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.7440228Z idx_m = rm[:, None] 2025-12-04T09:41:43.7440310Z idx_n = rn[None, :] 2025-12-04T09:41:43.7440406Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.7440411Z 2025-12-04T09:41:43.7440509Z # inductor generates a suffix 2025-12-04T09:41:43.7440600Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.7440866Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.7440953Z ''', device_str='cuda') 2025-12-04T09:41:43.7440957Z 2025-12-04T09:41:43.7440961Z 2025-12-04T09:41:43.7441060Z async_compile.wait(globals()) 2025-12-04T09:41:43.7441146Z del async_compile 2025-12-04T09:41:43.7441150Z 2025-12-04T09:41:43.7441229Z class Runner: 2025-12-04T09:41:43.7441334Z def __init__(self, partitions): 2025-12-04T09:41:43.7441437Z self.partitions = partitions 2025-12-04T09:41:43.7441528Z 2025-12-04T09:41:43.7441640Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.7441731Z new_callables = [] 2025-12-04T09:41:43.7441846Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.7441948Z new_callables.append(fn(c)) 2025-12-04T09:41:43.7442054Z self.partitions = new_callables 2025-12-04T09:41:43.7442058Z 2025-12-04T09:41:43.7442145Z def call(self, args): 2025-12-04T09:41:43.7442233Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.7442317Z args.clear() 2025-12-04T09:41:43.7442447Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.7442573Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.7442681Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.7442775Z torch.cuda.set_device(0) 2025-12-04T09:41:43.7442943Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.7443165Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.7443263Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.7443451Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.7443532Z del arg0_1 2025-12-04T09:41:43.7443699Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.7443952Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.7444052Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.7444316Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.7444398Z del arg1_1 2025-12-04T09:41:43.7444513Z del buf0 2025-12-04T09:41:43.7444600Z return (buf1, ) 2025-12-04T09:41:43.7444604Z 2025-12-04T09:41:43.7444701Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.7444785Z call = runner.call 2025-12-04T09:41:43.7444943Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.7444948Z 2025-12-04T09:41:43.7444952Z 2025-12-04T09:41:43.7445089Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.7445222Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.7448727Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.7448956Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.7449162Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.7449263Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.7449430Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.7449438Z 2025-12-04T09:41:43.7449442Z 2025-12-04T09:41:43.7449531Z if __name__ == "__main__": 2025-12-04T09:41:43.7449736Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.7449900Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.7449985Z From CHECK: .to( 2025-12-04T09:41:43.7449989Z 2025-12-04T09:41:43.7449993Z 2025-12-04T09:41:43.7450168Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.7450712Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.7450717Z 2025-12-04T09:41:43.7450997Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.7451178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7451272Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7451405Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7451652Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7453076Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7453162Z graph_break [] 2025-12-04T09:41:43.7453265Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7453442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7453535Z Autotune Choices Stats: 2025-12-04T09:41:43.7454372Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7454476Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7454562Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7454668Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7455151Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7455611Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7456123Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7456623Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7457085Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7457580Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7458061Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7458523Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7458979Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7459445Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7459777Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.7459869Z Autotune Choices Stats: 2025-12-04T09:41:43.7460699Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7460861Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7460948Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7461051Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7461564Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7462027Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7462480Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7462942Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7463399Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7463858Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7464312Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7464778Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7465283Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7465784Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7466118Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.7466291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7466388Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7466518Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7466763Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7467711Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7467797Z graph_break [] 2025-12-04T09:41:43.7467904Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7468074Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7468168Z Autotune Choices Stats: 2025-12-04T09:41:43.7468995Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7469087Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7469171Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7469320Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7469794Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7470268Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7470778Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7471255Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7471733Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7472217Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7472683Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7473152Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7473623Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7474081Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7474455Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.7474585Z Autotune Choices Stats: 2025-12-04T09:41:43.7475428Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7475524Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7475609Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7475712Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7476180Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7476647Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7477118Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7477583Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7478094Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7478553Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7479055Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7479597Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7480112Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7480586Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7480916Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.7481009Z Autotune Choices Stats: 2025-12-04T09:41:43.7481851Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.7481945Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7482032Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7482144Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7482624Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7483090Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7483590Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7484065Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7484580Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7485052Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7485521Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7485997Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7486470Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7486933Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7487264Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.7487437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7487547Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7487698Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7488001Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7488949Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7489033Z graph_break [] 2025-12-04T09:41:43.7489134Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7489348Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7489440Z Autotune Choices Stats: 2025-12-04T09:41:43.7490293Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.7490389Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7490474Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7490581Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7491056Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7491529Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7491991Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7492454Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7492969Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7493701Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7494174Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7494648Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7495112Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7495578Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7495915Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.7496009Z Autotune Choices Stats: 2025-12-04T09:41:43.7496843Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.7496938Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7497023Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7497126Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7497597Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7498109Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7498614Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7499081Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7499545Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7500015Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7500754Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7501233Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7501702Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7502177Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7502582Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.7502676Z Autotune Choices Stats: 2025-12-04T09:41:43.7503523Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7503671Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7503758Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7503868Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7504346Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7504828Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7505310Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7505778Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7506244Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7506705Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7507170Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7507735Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7508259Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7508727Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7509063Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.7509235Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7509331Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7509466Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7509708Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7511100Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7511182Z graph_break [] 2025-12-04T09:41:43.7511284Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7511458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7511547Z Autotune Choices Stats: 2025-12-04T09:41:43.7512433Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7512562Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7512647Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7512753Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7513236Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7513711Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7514175Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7514643Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7515110Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7515582Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7516054Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7516527Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7517047Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7517575Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7517933Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.7518026Z Autotune Choices Stats: 2025-12-04T09:41:43.7518859Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7518956Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7519042Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7519144Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7519677Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7520149Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7520626Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7521089Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7521596Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7522099Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7522569Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7523047Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7523522Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7524003Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7524335Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.7524509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7524606Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7524736Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7524980Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7525927Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7526059Z graph_break [] 2025-12-04T09:41:43.7526164Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7526338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7526429Z Autotune Choices Stats: 2025-12-04T09:41:43.7527303Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.7527399Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7527494Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7527615Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7528123Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7528604Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7529076Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7529549Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7530017Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7530552Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7531024Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7531540Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7532016Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7532489Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7532824Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.7532920Z Autotune Choices Stats: 2025-12-04T09:41:43.7533769Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7533867Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7533953Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7534058Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7534532Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7535006Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7535525Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7536005Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7536513Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7536982Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7537500Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7537977Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7538453Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7538932Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7539258Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.7539352Z Autotune Choices Stats: 2025-12-04T09:41:43.7540231Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7540367Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7540455Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7540565Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7541057Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7541541Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7542009Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7542493Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7542971Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7543454Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7543922Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7544397Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7544911Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7545388Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7545752Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.7545926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7546023Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7546152Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7546399Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7547352Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7547449Z graph_break [] 2025-12-04T09:41:43.7547566Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7547762Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7547853Z Autotune Choices Stats: 2025-12-04T09:41:43.7548690Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7548781Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7548868Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7548974Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7549492Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7550007Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7550485Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7550973Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7551451Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7551937Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7552496Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7553060Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7553613Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7554169Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7554540Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.7554633Z Autotune Choices Stats: 2025-12-04T09:41:43.7555507Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7555602Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7555686Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7555790Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7556263Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7556745Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7557222Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7557698Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7558173Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7558641Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7559152Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7559708Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7560186Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7560660Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7560988Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.7561083Z Autotune Choices Stats: 2025-12-04T09:41:43.7561907Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7562002Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7562093Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7562205Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7562683Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7563156Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7563671Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7564152Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7564696Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7565187Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7565672Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7566151Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7566626Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7567096Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7567428Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.7567599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7567694Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7567823Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7568112Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7569506Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7569627Z graph_break [] 2025-12-04T09:41:43.7569731Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7569903Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7569993Z Autotune Choices Stats: 2025-12-04T09:41:43.7570831Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7570925Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7571015Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7571119Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7571598Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7572075Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7572546Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7573071Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7573552Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7574068Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7574550Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7575034Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7575527Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7576004Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7576339Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.7576430Z Autotune Choices Stats: 2025-12-04T09:41:43.7577264Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7577364Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7577448Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7577590Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7578075Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7578591Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7579063Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7579530Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7580004Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7580471Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7580946Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7581419Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7581893Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7582412Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7582742Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.7582918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7583010Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7583179Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7583428Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7584803Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7584890Z graph_break [] 2025-12-04T09:41:43.7584999Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7585170Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7585262Z Autotune Choices Stats: 2025-12-04T09:41:43.7586115Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7586213Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7586298Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7586402Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7586930Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7587458Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7587975Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7588443Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7588911Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7589389Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7589864Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7590338Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7590810Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7591285Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7591725Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.7591817Z Autotune Choices Stats: 2025-12-04T09:41:43.7592701Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7592796Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7592884Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7592985Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7593468Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7593947Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7594423Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7594906Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7595375Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7595843Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7596355Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7596831Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7597348Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7597872Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7598208Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.7598381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7598479Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7598615Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7598860Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7600440Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7600527Z graph_break [] 2025-12-04T09:41:43.7600629Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7600803Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7600990Z Autotune Choices Stats: 2025-12-04T09:41:43.7601842Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.7601936Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7602021Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7602127Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7602654Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7603135Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7603607Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7604081Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7604562Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7605036Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7605515Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7606036Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7606508Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7607031Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7607409Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.7607504Z Autotune Choices Stats: 2025-12-04T09:41:43.7608343Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7608444Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7608529Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7608634Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7609112Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7609588Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7610064Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7610534Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7611042Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7611515Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7612019Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7612496Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7612966Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7613446Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7613777Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.7613951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7614047Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7614175Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7614421Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7615829Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7615951Z graph_break [] 2025-12-04T09:41:43.7616055Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7616226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7616324Z Autotune Choices Stats: 2025-12-04T09:41:43.7617157Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7617247Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7617337Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7617438Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7617964Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7618444Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7618916Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7619397Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7619878Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7620403Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7620926Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7621399Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7621873Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7622339Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7622676Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.7622770Z Autotune Choices Stats: 2025-12-04T09:41:43.7623603Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.7623695Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7623783Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7623888Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7624363Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7624880Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7625395Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7625874Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7626344Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7626806Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7627279Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7627780Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7628281Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7628753Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7629082Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.7629299Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7629392Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7629529Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7629775Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7631197Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7631292Z graph_break [] 2025-12-04T09:41:43.7631395Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7631574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7631667Z Autotune Choices Stats: 2025-12-04T09:41:43.7632516Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7632614Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7632701Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7632805Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7633290Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7633761Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7634273Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7634801Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7635273Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7635739Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7636210Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7636689Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7637176Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7637682Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7638011Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.7638106Z Autotune Choices Stats: 2025-12-04T09:41:43.7638947Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7639082Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7639170Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7639273Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7639890Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7640371Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7640844Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7641316Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7641785Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7642265Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7642727Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7643197Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7643711Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7644220Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7644556Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.7644730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7644828Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7644957Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7645201Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7646149Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7646234Z graph_break [] 2025-12-04T09:41:43.7646336Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7646514Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7646605Z Autotune Choices Stats: 2025-12-04T09:41:43.7647484Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.7647574Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7647659Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7647808Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7648286Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7648760Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7649265Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7649733Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7650206Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7650679Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7651155Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7651636Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7652114Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7652621Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7652956Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.7653092Z Autotune Choices Stats: 2025-12-04T09:41:43.7653935Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7654032Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7654117Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7654219Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7654707Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7655183Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7655655Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7656124Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7656591Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7657079Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7657613Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7658089Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7658597Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7659073Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7659402Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.7659496Z Autotune Choices Stats: 2025-12-04T09:41:43.7660340Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7660435Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7660526Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7660638Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7661118Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7661598Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7662114Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7662604Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7663122Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7663603Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7664085Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7664566Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7665057Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7665539Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7665872Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.7666049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7666145Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7666324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7666574Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7667544Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7667636Z graph_break [] 2025-12-04T09:41:43.7667797Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7667973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7668063Z Autotune Choices Stats: 2025-12-04T09:41:43.7668892Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7668988Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7669073Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7669184Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7669668Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7670154Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7670629Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7671106Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7671643Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7672149Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7672622Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7673088Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7673559Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7674029Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7674363Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.7674456Z Autotune Choices Stats: 2025-12-04T09:41:43.7675286Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7675385Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7675471Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7675573Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7676095Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7676569Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7677079Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7677547Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7678011Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7678488Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7678955Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7679430Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7679945Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7680421Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7680792Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.7680885Z Autotune Choices Stats: 2025-12-04T09:41:43.7681778Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7681872Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7681960Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7682069Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7682542Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7683020Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7683491Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7683977Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7684451Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7684928Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7685463Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7685942Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7686451Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7686918Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7687302Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.7687479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7687574Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7687709Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7687959Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7688905Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7688990Z graph_break [] 2025-12-04T09:41:43.7689094Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7689272Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7689361Z Autotune Choices Stats: 2025-12-04T09:41:43.7690238Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7690372Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7690457Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7690564Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7691046Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7691520Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7692009Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7692490Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7692981Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7693456Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7693929Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7694402Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7694945Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7695417Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7695788Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.7695884Z Autotune Choices Stats: 2025-12-04T09:41:43.7696716Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.7696816Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7696903Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7697005Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7697480Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7697959Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7698481Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7698949Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7699460Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7699970Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7700588Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7701066Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7701541Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7702022Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7702355Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.7702447Z Autotune Choices Stats: 2025-12-04T09:41:43.7703298Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7703389Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7703479Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7703588Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7704073Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7704621Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7705172Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7705647Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7706114Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7706584Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7707062Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7707537Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7708016Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7708492Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7708886Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.7709061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7709208Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7709341Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7709587Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7710979Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7711064Z graph_break [] 2025-12-04T09:41:43.7711168Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7711347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7711437Z Autotune Choices Stats: 2025-12-04T09:41:43.7712277Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7712369Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7712453Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7712560Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7713034Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7713549Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7714034Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7714548Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7715025Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7715494Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7715976Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7716443Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7716923Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7717449Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7717778Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.7717873Z Autotune Choices Stats: 2025-12-04T09:41:43.7718751Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7718883Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7718968Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7719071Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7719616Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7720085Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7720562Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7721031Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7721502Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7721970Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7722434Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7722966Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7723438Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7723953Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7724284Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.7727841Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7727957Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7728100Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7728381Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7730110Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7730198Z graph_break [] 2025-12-04T09:41:43.7730308Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7730505Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7730598Z Autotune Choices Stats: 2025-12-04T09:41:43.7731671Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.7731768Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7731856Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7732002Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7732489Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7732967Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7733436Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7733906Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7734374Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7734852Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7735321Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7735792Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7736308Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7736780Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7737119Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.7737266Z Autotune Choices Stats: 2025-12-04T09:41:43.7738097Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7738194Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7738280Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7738388Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7738862Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7739331Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7739800Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7740272Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7740740Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7741247Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7741799Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7742276Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7742747Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7743225Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7743556Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.7743734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7743826Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7743955Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7744203Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7745564Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7745719Z graph_break [] 2025-12-04T09:41:43.7745821Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7745994Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7746091Z Autotune Choices Stats: 2025-12-04T09:41:43.7746968Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7747064Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7747148Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7747251Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7747782Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7748259Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7748742Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7749221Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7749701Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7750210Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7750687Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7751197Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7751672Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7752140Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7752471Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.7752566Z Autotune Choices Stats: 2025-12-04T09:41:43.7753409Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7753503Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7753591Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7753693Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7754169Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7754638Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7755147Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7755618Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7756120Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7756600Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7757073Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7757552Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7758082Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7758555Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7758888Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.7759061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7759152Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7759287Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7759637Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7761023Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7761144Z graph_break [] 2025-12-04T09:41:43.7761246Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7761422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7761511Z Autotune Choices Stats: 2025-12-04T09:41:43.7762350Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7762446Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7762530Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7762640Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7763118Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7763601Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7764077Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7764596Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7765078Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7765597Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7766072Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7766548Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7767023Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7767492Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7767826Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.7767921Z Autotune Choices Stats: 2025-12-04T09:41:43.7768771Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7768872Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7769003Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7769110Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7769592Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7770108Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7770583Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7771056Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7771534Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7772000Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7772469Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7772952Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7773425Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7773944Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7774276Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.7774448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7774582Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7774714Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7774962Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7775905Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7775993Z graph_break [] 2025-12-04T09:41:43.7776099Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7776272Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7776371Z Autotune Choices Stats: 2025-12-04T09:41:43.7777253Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.7777358Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7777445Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7777549Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7778043Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7778560Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7779091Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7779572Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7780042Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7780524Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7780997Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7781478Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7781949Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7782422Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7782754Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.7782886Z Autotune Choices Stats: 2025-12-04T09:41:43.7783727Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7783823Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7783947Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7784055Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7784532Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7785011Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7785483Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7785956Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7786431Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7786897Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7787416Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7787931Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7788444Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7788924Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7789255Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.7789346Z Autotune Choices Stats: 2025-12-04T09:41:43.7790180Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7790276Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7790363Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7790472Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7790947Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7791422Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7791898Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7792415Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7792897Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7793416Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7793896Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7794383Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7794859Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7795334Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7795662Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.7795834Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7795930Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7796060Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7796309Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7797728Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7797849Z graph_break [] 2025-12-04T09:41:43.7797955Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7798128Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7798221Z Autotune Choices Stats: 2025-12-04T09:41:43.7799079Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.7799173Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7799263Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7799364Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7799896Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7800539Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7801020Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7801496Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7802056Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7802533Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7803055Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7803527Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7804002Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7804482Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7804821Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.7804912Z Autotune Choices Stats: 2025-12-04T09:41:43.7805772Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7805864Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7805949Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7806054Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7806591Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7807093Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7807650Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7808115Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7808583Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7809055Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7809526Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7810002Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7810477Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7810950Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7811324Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.7811501Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7811595Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7811727Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7811973Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7813394Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7813482Z graph_break [] 2025-12-04T09:41:43.7813585Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7813763Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7813853Z Autotune Choices Stats: 2025-12-04T09:41:43.7814695Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7814790Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7814875Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7814977Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7815459Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7816004Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7816478Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7817017Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7817532Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7818000Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7818473Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7818942Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7819413Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7819882Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7820212Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.7820307Z Autotune Choices Stats: 2025-12-04T09:41:43.7821189Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7821283Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7821370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7821472Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7821989Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7822465Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7822934Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7823405Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7823877Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7824348Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7824813Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7825327Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7825806Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7826323Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7826654Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.7826825Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7826920Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7827049Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7827292Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7828675Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7828759Z graph_break [] 2025-12-04T09:41:43.7828864Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7829035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7829124Z Autotune Choices Stats: 2025-12-04T09:41:43.7829963Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7830096Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7830182Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7830288Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7830767Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7831287Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7831764Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7832249Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7832734Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7833206Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7833677Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7834152Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7834673Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7835144Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7835513Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.7835605Z Autotune Choices Stats: 2025-12-04T09:41:43.7836447Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7836541Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7836626Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7836732Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7837265Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7837743Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7838215Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7838683Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7839154Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7839708Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7840179Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7840695Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7841168Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7841645Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7841978Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.7842156Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7842248Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7842376Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7842629Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7844003Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7844129Z graph_break [] 2025-12-04T09:41:43.7844232Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7844441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7844533Z Autotune Choices Stats: 2025-12-04T09:41:43.7845368Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7845464Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7845548Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7845651Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7846131Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7846606Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7847108Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7847606Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7848073Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7848548Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7849064Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7849545Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7850080Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7850563Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7850895Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.7850991Z Autotune Choices Stats: 2025-12-04T09:41:43.7851837Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7851931Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7852021Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7852124Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7852603Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7853078Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7853597Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7854106Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7854576Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7855043Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7855512Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7855991Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7856470Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7856943Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7857273Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.7857444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7857536Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7857710Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7857960Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7858904Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7859029Z graph_break [] 2025-12-04T09:41:43.7859132Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7859306Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7859396Z Autotune Choices Stats: 2025-12-04T09:41:43.7860224Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7860322Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7860407Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7860517Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7860993Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7861471Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7861947Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7862419Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7862945Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7863475Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7863953Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7864419Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7864892Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7865373Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7865708Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.7865805Z Autotune Choices Stats: 2025-12-04T09:41:43.7866639Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7866731Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7866822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7866966Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7867467Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7867967Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7868474Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7868951Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7869420Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7869897Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7870366Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7870846Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7871322Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7871793Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7872167Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.7872296Z Autotune Choices Stats: 2025-12-04T09:41:43.7873148Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7873240Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7873325Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7873439Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7873915Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7874396Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7874871Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7875347Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7875913Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7876475Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7877004Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7877506Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7878072Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7878557Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7878888Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.7879069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7879161Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7879298Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7879598Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7880976Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7881063Z graph_break [] 2025-12-04T09:41:43.7881166Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7881344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7881441Z Autotune Choices Stats: 2025-12-04T09:41:43.7882327Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7882463Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7882557Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7882664Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7883143Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7883618Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7884098Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7884576Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7885059Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7885533Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7886016Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7886553Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7887030Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7887601Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7887937Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.7888033Z Autotune Choices Stats: 2025-12-04T09:41:43.7888881Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7888977Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7889071Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7889172Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7889659Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7890133Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7890608Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7891175Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7891648Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7892161Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7892628Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7893107Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7893585Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7894062Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7894399Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.7894570Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7894666Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7894797Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7895040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7895990Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7896113Z graph_break [] 2025-12-04T09:41:43.7896223Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7896396Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7896487Z Autotune Choices Stats: 2025-12-04T09:41:43.7897376Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7897491Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7897588Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7897716Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7898197Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7898683Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7899158Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7899635Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7900115Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7900809Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7901352Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7901827Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7902312Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7902781Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7903122Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.7903214Z Autotune Choices Stats: 2025-12-04T09:41:43.7904059Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7904158Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7904244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7904350Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7904828Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7905356Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7905829Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7906359Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7906839Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7907362Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7907835Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7908318Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7908797Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7909277Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7909612Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.7909713Z Autotune Choices Stats: 2025-12-04T09:41:43.7910586Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.7910715Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7910803Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7910916Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7911399Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7911869Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7912349Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7912824Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7913298Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7913771Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7914239Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7914764Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7915237Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7915749Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7916081Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.7916255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7916354Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7916485Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7916733Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7918116Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7918203Z graph_break [] 2025-12-04T09:41:43.7918311Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7918486Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7918578Z Autotune Choices Stats: 2025-12-04T09:41:43.7919458Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7919601Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7919760Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7919865Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7920350Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7920838Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7921315Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7921797Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7922267Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7922737Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7923210Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7923679Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7924205Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7924676Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7925053Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.7925147Z Autotune Choices Stats: 2025-12-04T09:41:43.7925982Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7926080Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7926172Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7926279Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7926763Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7927286Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7927770Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7928239Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7928754Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7929226Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7929735Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7930215Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7930693Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7931179Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7931511Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.7931688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7931782Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7931917Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7932166Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7933546Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7933675Z graph_break [] 2025-12-04T09:41:43.7933780Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7933957Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7934053Z Autotune Choices Stats: 2025-12-04T09:41:43.7934927Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7935028Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7935114Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7935220Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7935697Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7936179Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7936677Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7937196Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7937675Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7938190Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7938674Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7939191Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7939667Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7940153Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7940487Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.7940584Z Autotune Choices Stats: 2025-12-04T09:41:43.7941434Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7941532Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7941623Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7941725Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7942201Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7942677Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7943190Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7943665Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7944172Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7944641Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7945115Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7945591Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7946077Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7946549Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7946888Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.7947063Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7947165Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7947305Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7947591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7949011Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7949097Z graph_break [] 2025-12-04T09:41:43.7949200Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7949379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7949470Z Autotune Choices Stats: 2025-12-04T09:41:43.7950315Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.7950411Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7950497Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7950607Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7951087Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7951559Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7952032Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7952560Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7953039Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7953549Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7954037Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.7954519Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7955002Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7955487Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7955821Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.7955919Z Autotune Choices Stats: 2025-12-04T09:41:43.7956769Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7956873Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7957037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7957156Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7957704Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7958182Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7958662Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7959131Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7959661Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7960141Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7960612Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7961092Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7961569Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7962098Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7962433Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.7962649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7962748Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7962880Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7963125Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7964072Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7964160Z graph_break [] 2025-12-04T09:41:43.7964270Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7964449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7964543Z Autotune Choices Stats: 2025-12-04T09:41:43.7965382Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7965475Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7965570Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7965674Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7966152Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7966673Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7967224Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7967718Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7968194Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7968675Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7969155Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7969630Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7970113Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7970582Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7970958Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.7971053Z Autotune Choices Stats: 2025-12-04T09:41:43.7971886Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7972024Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7972114Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7972224Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7972700Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7973172Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7973650Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7974127Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7974603Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7975081Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7975557Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7976071Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7976585Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7977069Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7977399Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.7977496Z Autotune Choices Stats: 2025-12-04T09:41:43.7978322Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.7978420Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7978512Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7978623Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.7979105Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7979573Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7980043Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7980584Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7981058Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7981575Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7982050Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7982530Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7983008Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7983488Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7983823Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.7984000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.7984100Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.7984230Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.7984477Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.7985466Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.7985586Z graph_break [] 2025-12-04T09:41:43.7985696Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.7985868Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.7985962Z Autotune Choices Stats: 2025-12-04T09:41:43.7986812Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.7986906Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7986998Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7987105Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.7987586Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7988087Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7988582Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7989064Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7989546Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7990062Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7990542Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7991074Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7991548Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7992021Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7992366Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.7992462Z Autotune Choices Stats: 2025-12-04T09:41:43.7993288Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.7993386Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.7993473Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.7993584Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.7994059Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.7994586Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7995105Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7995585Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.7996066Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.7996535Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7997010Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.7997492Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7998020Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.7998503Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.7998831Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.7998968Z Autotune Choices Stats: 2025-12-04T09:41:43.7999852Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.7999947Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8000041Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8000191Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8000822Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8001306Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8001785Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8002271Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8002765Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8003243Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8003715Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8004275Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8004804Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8005273Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8005608Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.8005782Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8005881Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8006013Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8010031Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8011451Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8011541Z graph_break [] 2025-12-04T09:41:43.8011649Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8011828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8011921Z Autotune Choices Stats: 2025-12-04T09:41:43.8012767Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.8012956Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8013050Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8013158Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8013695Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8014173Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8014653Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8015138Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8015615Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8016098Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8016571Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8017061Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8017602Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8018115Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8018453Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.8018545Z Autotune Choices Stats: 2025-12-04T09:41:43.8019378Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8019480Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8019566Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8019680Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8020158Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8020634Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8021114Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8021583Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8022106Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8022576Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8023085Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8023567Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8024042Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8024525Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8024855Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.8025035Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8025133Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8025270Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8025523Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8026464Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8026555Z graph_break [] 2025-12-04T09:41:43.8026699Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8026877Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8027019Z Autotune Choices Stats: 2025-12-04T09:41:43.8027908Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8028004Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8028096Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8028201Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8028680Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8029170Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8029655Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8030131Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8030601Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8031073Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8031612Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8032088Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8032595Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8033078Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8033419Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.8033516Z Autotune Choices Stats: 2025-12-04T09:41:43.8034353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:43.8034451Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8034539Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8034651Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8035127Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8035600Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8036109Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8036589Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8037099Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8037566Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8038034Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8038516Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8038997Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8039541Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8039885Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:43.8039982Z Autotune Choices Stats: 2025-12-04T09:41:43.8040828Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8040970Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8041059Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8041173Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8041649Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8042209Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8042685Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8043155Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8043634Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8044117Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8044589Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8045068Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8045576Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8046046Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8046418Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:43.8046602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8046697Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8046828Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8047103Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8048508Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8048602Z graph_break [] 2025-12-04T09:41:43.8048706Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8048883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8048979Z Autotune Choices Stats: 2025-12-04T09:41:43.8049820Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8049917Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8050051Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8050163Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8050654Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8051165Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8051635Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8052119Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8052598Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8053087Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8053566Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8054042Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8054510Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8055021Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8055359Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:43.8055494Z Autotune Choices Stats: 2025-12-04T09:41:43.8056353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8056447Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8056533Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8056642Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8057120Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8057631Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8058132Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8058604Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8059082Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8059559Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8060080Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8060603Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8061093Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8061565Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8061902Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:43.8062081Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8062177Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8062318Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8062564Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8063505Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8063593Z graph_break [] 2025-12-04T09:41:43.8063697Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8063875Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8063970Z Autotune Choices Stats: 2025-12-04T09:41:43.8064855Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8065013Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8065101Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8065211Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8065700Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8066179Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8066656Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8067122Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8067595Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8068068Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8068535Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8069052Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8069527Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8070044Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8070379Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:43.8070475Z Autotune Choices Stats: 2025-12-04T09:41:43.8071305Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8071403Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8071498Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8071607Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8072086Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8072563Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8073036Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8073545Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8074028Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8074536Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8075001Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8075477Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8075951Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8076426Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8076760Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:43.8076855Z Autotune Choices Stats: 2025-12-04T09:41:43.8077748Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.8077843Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8077931Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8078089Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8078564Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8079045Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8079598Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8080077Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8080557Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8081051Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8081528Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8082011Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8082484Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8082990Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8083329Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:43.8083547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8083642Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8083777Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8084030Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8085416Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8085508Z graph_break [] 2025-12-04T09:41:43.8085614Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8085790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8085886Z Autotune Choices Stats: 2025-12-04T09:41:43.8086730Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8086828Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8086916Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8087023Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8087515Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8088074Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8088555Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8089073Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8089553Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8090028Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8090513Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8090986Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8091457Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8091927Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8092263Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.8092362Z Autotune Choices Stats: 2025-12-04T09:41:43.8093241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8093374Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8093466Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8093571Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8094062Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8094535Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8095008Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8095482Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8095963Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8096435Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8096903Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8097430Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8097959Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8098471Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8098807Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:43.8098980Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8099082Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8099220Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8099469Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8100601Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8100694Z graph_break [] 2025-12-04T09:41:43.8100802Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8100977Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8101070Z Autotune Choices Stats: 2025-12-04T09:41:43.8101912Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8102103Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8102192Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8102355Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8102842Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8103320Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8103797Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8104276Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8104755Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8105240Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8105714Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8106188Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8106669Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8107204Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8107542Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:43.8107661Z Autotune Choices Stats: 2025-12-04T09:41:43.8108581Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8108679Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8108766Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8108875Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8109356Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8109834Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8110315Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8110785Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8111258Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8111778Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8112291Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8112780Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8113254Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8113730Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8114064Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:43.8114161Z Autotune Choices Stats: 2025-12-04T09:41:43.8115004Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.8115098Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8115188Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8115300Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8115778Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8116304Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8116773Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8117293Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8117822Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8118301Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8118784Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8119260Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8119801Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8120267Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8120601Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:43.8120779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8120919Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8121057Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8121343Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8122287Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8122369Z graph_break [] 2025-12-04T09:41:43.8122472Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8122647Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8122738Z Autotune Choices Stats: 2025-12-04T09:41:43.8123581Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8123679Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8123766Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8123875Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8124355Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8124833Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8125302Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8125815Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8126289Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8126793Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8127321Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8127787Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8128268Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8128742Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8129072Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:43.8129166Z Autotune Choices Stats: 2025-12-04T09:41:43.8129995Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8130100Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8130228Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8130333Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8130809Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8131332Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8131806Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8132271Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8132742Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8133210Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8133677Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8134154Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8134625Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8135143Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8135476Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:43.8135569Z Autotune Choices Stats: 2025-12-04T09:41:43.8136475Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:43.8136570Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8136661Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8136772Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8137257Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8137780Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8138251Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8138719Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8139185Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8139696Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8140169Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8140684Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8141157Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8141626Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8141963Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:43.8142136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8142232Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8142365Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8142610Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8143991Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8144119Z graph_break [] 2025-12-04T09:41:43.8144223Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8144404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8144496Z Autotune Choices Stats: 2025-12-04T09:41:43.8145370Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8145466Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8145554Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8145662Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8146137Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8146629Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8147107Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8147591Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8148110Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8148583Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8149094Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8149599Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8150079Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8150552Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8150884Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:43.8150987Z Autotune Choices Stats: 2025-12-04T09:41:43.8151831Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:43.8151936Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8152028Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8152138Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8152625Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8153099Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8153587Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8154106Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8154637Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8155106Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8155577Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8156071Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8156548Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8157045Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8157380Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:43.8157555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8157657Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8157792Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8158051Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8159033Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8159154Z graph_break [] 2025-12-04T09:41:43.8159268Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8159448Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8159594Z Autotune Choices Stats: 2025-12-04T09:41:43.8160419Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8160519Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8160616Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8160724Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8161201Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8161705Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8162181Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8162664Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8163194Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8163683Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8164188Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8164661Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8165144Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8165617Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8165960Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:43.8166054Z Autotune Choices Stats: 2025-12-04T09:41:43.8166893Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8166991Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8167099Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8167222Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8167756Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8168238Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8168754Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8169229Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8169705Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8170179Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8170653Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8171132Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8171613Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8172088Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8172488Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:43.8172592Z Autotune Choices Stats: 2025-12-04T09:41:43.8173425Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.8173572Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8173664Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8173775Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8174258Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8174730Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8175209Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8175684Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8176159Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8176634Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8177141Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8177635Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8178187Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8178664Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8178996Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:43.8179171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8179277Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8179414Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8179668Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8181054Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8181140Z graph_break [] 2025-12-04T09:41:43.8181251Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8181426Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8181525Z Autotune Choices Stats: 2025-12-04T09:41:43.8182403Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8182503Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8182598Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8182706Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8183228Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8183718Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8184200Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8184686Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8185158Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8185640Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8186112Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8186635Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8187105Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8187642Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8188015Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:43.8188111Z Autotune Choices Stats: 2025-12-04T09:41:43.8188947Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8189049Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8189137Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8189249Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8189730Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8190214Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8190689Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8191164Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8191738Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8192211Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8192723Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8193198Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8193681Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8194157Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8194494Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:43.8194723Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.8194829Z Traceback (most recent call last): 2025-12-04T09:41:43.8195251Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.8195436Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.8195782Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.8195978Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.8196185Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.8196275Z Searched string: 2025-12-04T09:41:43.8196454Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.8196460Z 2025-12-04T09:41:43.8196579Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.8196584Z 2025-12-04T09:41:43.8196725Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.8196853Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.8196858Z 2025-12-04T09:41:43.8196961Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.8197076Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.8197182Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.8197286Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.8197295Z 2025-12-04T09:41:43.8197386Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.8197481Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.8197581Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.8197675Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.8197679Z 2025-12-04T09:41:43.8197683Z 2025-12-04T09:41:43.8197849Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.8197853Z 2025-12-04T09:41:43.8197857Z 2025-12-04T09:41:43.8197985Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.8198105Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.8198225Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.8198313Z idx_m = rm[:, None] 2025-12-04T09:41:43.8198401Z idx_n = rn[None, :] 2025-12-04T09:41:43.8198503Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.8198507Z 2025-12-04T09:41:43.8198610Z # inductor generates a suffix 2025-12-04T09:41:43.8198702Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.8198924Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.8199057Z ''', device_str='cuda') 2025-12-04T09:41:43.8199061Z 2025-12-04T09:41:43.8199068Z 2025-12-04T09:41:43.8199173Z async_compile.wait(globals()) 2025-12-04T09:41:43.8199259Z del async_compile 2025-12-04T09:41:43.8199266Z 2025-12-04T09:41:43.8199351Z class Runner: 2025-12-04T09:41:43.8199462Z def __init__(self, partitions): 2025-12-04T09:41:43.8199620Z self.partitions = partitions 2025-12-04T09:41:43.8199624Z 2025-12-04T09:41:43.8199777Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.8199876Z new_callables = [] 2025-12-04T09:41:43.8199993Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.8200108Z new_callables.append(fn(c)) 2025-12-04T09:41:43.8200215Z self.partitions = new_callables 2025-12-04T09:41:43.8200220Z 2025-12-04T09:41:43.8200462Z def call(self, args): 2025-12-04T09:41:43.8200559Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.8200645Z args.clear() 2025-12-04T09:41:43.8200778Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.8200912Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.8201021Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.8201123Z torch.cuda.set_device(0) 2025-12-04T09:41:43.8201297Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.8201522Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.8201627Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.8201822Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.8201907Z del arg0_1 2025-12-04T09:41:43.8202077Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.8202339Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.8202440Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.8202739Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.8202823Z del arg1_1 2025-12-04T09:41:43.8202906Z del buf0 2025-12-04T09:41:43.8203071Z return (buf1, ) 2025-12-04T09:41:43.8203076Z 2025-12-04T09:41:43.8203178Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.8203270Z call = runner.call 2025-12-04T09:41:43.8203430Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.8203437Z 2025-12-04T09:41:43.8203441Z 2025-12-04T09:41:43.8203581Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.8203719Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.8203868Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.8204076Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.8204277Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.8204382Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.8204555Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.8204560Z 2025-12-04T09:41:43.8204566Z 2025-12-04T09:41:43.8204657Z if __name__ == "__main__": 2025-12-04T09:41:43.8204861Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.8205029Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.8205115Z From CHECK: .to( 2025-12-04T09:41:43.8205119Z 2025-12-04T09:41:43.8205123Z 2025-12-04T09:41:43.8205305Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.8205860Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.8205865Z 2025-12-04T09:41:43.8206081Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.8206330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8206429Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8206570Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8206821Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8208300Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8208398Z graph_break [] 2025-12-04T09:41:43.8208504Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8208688Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8208783Z Autotune Choices Stats: 2025-12-04T09:41:43.8209626Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8209729Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8209821Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8209934Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8210420Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8210882Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8211404Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8211904Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8212372Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8212845Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8213310Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8213776Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8214237Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8214711Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8215049Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.8215150Z Autotune Choices Stats: 2025-12-04T09:41:43.8215986Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8216124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8216222Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8216328Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8216850Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8217318Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8217779Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8218246Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8218703Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8219171Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8219629Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8220100Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8220605Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8221075Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8221452Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.8221629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8221737Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8221870Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8222117Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8223063Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8223152Z graph_break [] 2025-12-04T09:41:43.8223265Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8223439Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8223534Z Autotune Choices Stats: 2025-12-04T09:41:43.8224367Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8224464Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8224552Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8224665Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8225182Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8225663Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8226563Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8227051Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8227533Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8228069Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8228546Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8229018Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8229496Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8229958Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8230338Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.8230433Z Autotune Choices Stats: 2025-12-04T09:41:43.8231348Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8231452Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8231542Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8231648Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8232123Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8232593Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8233070Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8233539Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8234006Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8234466Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8234971Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8235447Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8235953Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8236431Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8236760Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.8236858Z Autotune Choices Stats: 2025-12-04T09:41:43.8237739Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.8237852Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8237949Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8238062Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8238553Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8239017Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8239574Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8240118Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8240628Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8241115Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8241583Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8242055Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8242530Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8242995Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8243331Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.8243507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8243606Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8243739Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8243989Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8244981Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8245069Z graph_break [] 2025-12-04T09:41:43.8245176Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8245394Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8245490Z Autotune Choices Stats: 2025-12-04T09:41:43.8246327Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.8246427Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8246520Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8246636Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8247114Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8247589Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8248052Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8248513Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8249031Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8249504Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8250018Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8250489Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8250954Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8251424Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8251761Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.8251866Z Autotune Choices Stats: 2025-12-04T09:41:43.8252698Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.8252797Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8252888Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8252993Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8253469Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8254010Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8254485Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8254987Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8255458Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8255921Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8256397Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8256874Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8257345Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8257875Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8258203Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.8258346Z Autotune Choices Stats: 2025-12-04T09:41:43.8259183Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8259319Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8259416Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8259531Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8260017Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8260494Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8260979Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8261456Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8261919Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8262387Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8262852Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8263414Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8263880Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8264388Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8264766Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.8264942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8265044Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8265181Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8265432Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8266819Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8266907Z graph_break [] 2025-12-04T09:41:43.8267018Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8267195Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8267289Z Autotune Choices Stats: 2025-12-04T09:41:43.8268225Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8268361Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8268455Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8268565Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8269053Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8269523Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8269984Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8270456Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8275407Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8275919Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8276399Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8276871Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8277414Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8277877Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8278285Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.8278379Z Autotune Choices Stats: 2025-12-04T09:41:43.8279228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8279331Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8279417Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8279587Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8280068Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8280543Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8281065Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8281526Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8282056Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8282523Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8283045Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8283519Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8283996Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8284472Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8284872Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.8285050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8285149Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8285288Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8285537Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8286534Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8286668Z graph_break [] 2025-12-04T09:41:43.8286777Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8286961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8287056Z Autotune Choices Stats: 2025-12-04T09:41:43.8287981Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.8288083Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8288171Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8288284Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8288770Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8289248Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8289721Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8290190Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8290696Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8291167Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8291682Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8292239Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8292776Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8293450Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8293976Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.8294117Z Autotune Choices Stats: 2025-12-04T09:41:43.8295261Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8295403Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8295528Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8295682Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8296363Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8297026Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8297883Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8298525Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8299219Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8299747Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8300219Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8301078Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8301561Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8302056Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8302393Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.8302489Z Autotune Choices Stats: 2025-12-04T09:41:43.8303464Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8303578Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8303731Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8303847Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8304330Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8304823Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8305294Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8305776Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8306251Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8306736Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8307213Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8307822Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8308552Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8309225Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8309803Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.8310067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8310205Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8310398Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8310752Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8311713Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8311811Z graph_break [] 2025-12-04T09:41:43.8311917Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8312100Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8312194Z Autotune Choices Stats: 2025-12-04T09:41:43.8313037Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8313137Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8313225Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8313339Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8313880Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8314364Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8314893Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8315375Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8315863Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8316354Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8316842Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8317516Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8318173Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8318852Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8319438Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.8319706Z Autotune Choices Stats: 2025-12-04T09:41:43.8321046Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8321192Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8321332Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8321486Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8322155Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8322730Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8323207Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8323701Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8324174Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8324645Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8325183Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8325700Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8326176Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8326650Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8326992Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.8327098Z Autotune Choices Stats: 2025-12-04T09:41:43.8327991Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8328088Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8328175Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8328292Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8328771Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8329247Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8329719Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8330241Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8330765Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8331459Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8332284Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8332995Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8333663Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8334193Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8334537Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.8334729Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8334826Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8334967Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8335218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8336667Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8336796Z graph_break [] 2025-12-04T09:41:43.8336902Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8337083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8337177Z Autotune Choices Stats: 2025-12-04T09:41:43.8338013Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8338129Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8338217Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8338330Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8338814Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8339292Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8339771Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8340254Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8340783Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8341307Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8341792Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8342277Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8342763Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8343240Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8343581Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.8343679Z Autotune Choices Stats: 2025-12-04T09:41:43.8344519Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8344614Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8344711Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8344818Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8345344Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8345914Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8346387Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8346859Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8347330Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8347856Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8348337Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8348816Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8349299Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8349816Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8350155Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.8350332Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8350433Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8350639Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8350890Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8352324Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8352415Z graph_break [] 2025-12-04T09:41:43.8352529Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8352708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8352800Z Autotune Choices Stats: 2025-12-04T09:41:43.8353663Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8353757Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8353850Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8353957Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8354446Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8354975Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8355513Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8355990Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8356458Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8356929Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8357464Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8357941Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8358420Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8358902Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8359283Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.8359380Z Autotune Choices Stats: 2025-12-04T09:41:43.8360303Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8360452Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8360541Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8360660Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8361146Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8361623Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8362114Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8362588Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8363063Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8363529Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8364002Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8364529Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8365041Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8365522Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8365853Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.8366037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8366138Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8366273Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8366526Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8367920Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8368009Z graph_break [] 2025-12-04T09:41:43.8368115Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8368290Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8368396Z Autotune Choices Stats: 2025-12-04T09:41:43.8369276Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.8369379Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8369467Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8369573Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8370099Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8370586Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8371071Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8371561Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8372035Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8372526Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8373002Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8373514Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8373989Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8374503Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8374836Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.8374928Z Autotune Choices Stats: 2025-12-04T09:41:43.8375776Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8375873Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8375968Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8376075Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8376562Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8377045Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8377519Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8378050Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8378562Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8379033Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8379548Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8380025Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8380503Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8380981Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8381318Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.8381506Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8381602Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8381740Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8381985Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8383417Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8383541Z graph_break [] 2025-12-04T09:41:43.8383646Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8383827Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8383920Z Autotune Choices Stats: 2025-12-04T09:41:43.8384766Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8384860Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8384947Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8385063Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8385544Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8386220Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8386939Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8387706Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8388555Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8389379Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8390154Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8391089Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8391862Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8392580Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8393069Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.8393212Z Autotune Choices Stats: 2025-12-04T09:41:43.8394379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.8394538Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8394671Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8394838Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8395602Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8396519Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8397328Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8398171Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8398967Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8399840Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8400819Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8401629Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8402432Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8403244Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8403815Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.8404262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8404435Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8404655Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8405071Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8407556Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8407709Z graph_break [] 2025-12-04T09:41:43.8407884Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8408176Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8408340Z Autotune Choices Stats: 2025-12-04T09:41:43.8409743Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8409909Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8410061Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8410227Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8411026Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8411814Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8412717Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8413575Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8414363Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8415138Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8418869Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8419706Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8420521Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8421309Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8421863Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.8422015Z Autotune Choices Stats: 2025-12-04T09:41:43.8423438Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8423707Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8423853Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8424019Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8424896Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8425689Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8426478Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8427274Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8428038Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8428841Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8429612Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8430394Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8431204Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8432093Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8432649Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.8432953Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8433111Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8433328Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8433736Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8435440Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8435619Z graph_break [] 2025-12-04T09:41:43.8435795Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8436091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8436247Z Autotune Choices Stats: 2025-12-04T09:41:43.8437652Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.8437813Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8437965Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8438133Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8439029Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8439890Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8440759Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8441537Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8442326Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8443120Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8443924Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8444740Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8445534Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8446323Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8446884Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.8447128Z Autotune Choices Stats: 2025-12-04T09:41:43.8448571Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8448732Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8448873Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8449052Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8449840Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8450748Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8451541Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8452308Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8453118Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8453903Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8454831Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8455605Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8456489Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8457287Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8457839Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.8458005Z Autotune Choices Stats: 2025-12-04T09:41:43.8459390Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8459562Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8459699Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8459871Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8460698Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8461496Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8462279Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8463097Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8464025Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8464817Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8465606Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8466514Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8467319Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8468209Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8468770Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.8469070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8469230Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8469440Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8469959Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8471527Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8471679Z graph_break [] 2025-12-04T09:41:43.8471950Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8472248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8472399Z Autotune Choices Stats: 2025-12-04T09:41:43.8473759Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8473938Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8474090Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8474262Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8475056Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8475869Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8476684Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8477479Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8478322Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8479225Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8480090Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8480873Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8481744Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8482528Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8483096Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.8483246Z Autotune Choices Stats: 2025-12-04T09:41:43.8484620Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8484778Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8484922Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8485099Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8485969Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8486768Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8487618Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8488391Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8489181Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8489974Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8490737Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8491543Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8492333Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8493114Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8493669Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.8493827Z Autotune Choices Stats: 2025-12-04T09:41:43.8495198Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8495452Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8495593Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8495774Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8496583Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8497508Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8498311Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8499140Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8499922Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8500922Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8501884Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8502672Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8503582Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8504375Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8504942Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.8505237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8505413Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8505629Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8506035Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8507670Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8507816Z graph_break [] 2025-12-04T09:41:43.8507999Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8508278Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8508449Z Autotune Choices Stats: 2025-12-04T09:41:43.8509825Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8509986Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8510265Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8510436Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8511232Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8512047Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8512832Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8513797Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8514620Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8515421Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8516204Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8516976Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8517850Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8518647Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8519289Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.8519445Z Autotune Choices Stats: 2025-12-04T09:41:43.8520889Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.8521055Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8521199Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8521384Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8522172Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8522991Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8523769Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8524530Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8525331Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8526112Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8527002Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8527818Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8528615Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8529504Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8530071Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.8530237Z Autotune Choices Stats: 2025-12-04T09:41:43.8531619Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8531788Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8531933Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8532113Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8532936Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8533826Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8534624Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8535492Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8536276Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8537087Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8537902Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8538706Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8539460Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8540261Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8540841Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.8541143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8541305Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8541613Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8542017Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8544357Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8544509Z graph_break [] 2025-12-04T09:41:43.8544691Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8545078Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8545242Z Autotune Choices Stats: 2025-12-04T09:41:43.8546626Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8546801Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8546953Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8547123Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8547924Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8548716Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8549620Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8550410Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8551264Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8551963Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8552746Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8553496Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8554255Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8555037Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8555610Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.8555757Z Autotune Choices Stats: 2025-12-04T09:41:43.8557149Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8557418Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8557579Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8557786Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8558603Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8559396Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8560398Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8561199Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8561990Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8562766Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8563563Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8564368Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8565271Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8566065Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8566776Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.8567102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8567282Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8567508Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8567913Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8570237Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8570396Z graph_break [] 2025-12-04T09:41:43.8570566Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8570872Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8571022Z Autotune Choices Stats: 2025-12-04T09:41:43.8572428Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.8572598Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8572736Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8572924Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8573865Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8574672Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8575473Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8576351Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8577181Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8577984Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8578805Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8579591Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8580389Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8581295Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8581858Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.8582025Z Autotune Choices Stats: 2025-12-04T09:41:43.8583502Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8583664Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8583813Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8583987Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8584792Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8585592Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8586388Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8587187Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8588022Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8588822Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8589713Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8590520Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8591305Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8592184Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8592763Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.8593072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8593241Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8593451Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8593859Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8596215Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8596450Z graph_break [] 2025-12-04T09:41:43.8596631Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8596920Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8597078Z Autotune Choices Stats: 2025-12-04T09:41:43.8598617Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8598786Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8598932Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8599106Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8600025Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8601097Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8601908Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8602723Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8603536Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8604361Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8605166Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8606091Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8606886Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8607677Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8608229Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.8608515Z Autotune Choices Stats: 2025-12-04T09:41:43.8609919Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8610095Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8610234Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8610413Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8611209Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8612013Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8612936Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8613717Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8614633Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8615438Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8616230Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8617028Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8617868Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8618679Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8619248Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.8619558Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8619716Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8619933Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8620363Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8622714Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8622960Z graph_break [] 2025-12-04T09:41:43.8623132Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8623423Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8623586Z Autotune Choices Stats: 2025-12-04T09:41:43.8625049Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8625225Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8625370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8625547Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8626391Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8627200Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8628053Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8628949Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8629777Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8630707Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8631486Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8632288Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8633085Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8633887Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8634461Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.8634616Z Autotune Choices Stats: 2025-12-04T09:41:43.8646697Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8646891Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8647048Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8647212Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8648064Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8648976Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8649773Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8650558Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8651436Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8652211Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8652993Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8653780Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8654586Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8655494Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8656058Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.8656347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8656502Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8656799Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8657208Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8658859Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8659013Z graph_break [] 2025-12-04T09:41:43.8659184Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8659482Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8659633Z Autotune Choices Stats: 2025-12-04T09:41:43.8661034Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.8661200Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8661336Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8661506Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8662315Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8663128Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8664012Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8664801Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8665608Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8666502Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8667298Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8668156Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8668960Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8669765Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8670321Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.8670580Z Autotune Choices Stats: 2025-12-04T09:41:43.8671958Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8672134Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8672274Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8672525Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8673323Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8674124Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8674933Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8675718Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8676518Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8677306Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8678141Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8678956Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8679919Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8680738Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8681293Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.8681440Z Autotune Choices Stats: 2025-12-04T09:41:43.8682920Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8683088Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8683235Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8683420Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8684212Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8685025Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8685816Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8686615Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8687542Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8688424Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8689237Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8690056Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8690859Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8691629Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8692195Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.8692498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8692651Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8692863Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8693273Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8695625Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8695864Z graph_break [] 2025-12-04T09:41:43.8696033Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8696339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8696484Z Autotune Choices Stats: 2025-12-04T09:41:43.8697950Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.8698114Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8698349Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8698530Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8699336Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8700150Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8701083Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8701604Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8702247Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8702714Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8703251Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8703721Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8704199Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8704676Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8705020Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.8705120Z Autotune Choices Stats: 2025-12-04T09:41:43.8705958Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8706058Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8706145Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8706252Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8706737Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8707213Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8707797Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8708268Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8708736Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8709257Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8709728Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8710211Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8710684Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8711156Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8711530Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.8711715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8711811Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8711944Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8712193Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8713620Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8713709Z graph_break [] 2025-12-04T09:41:43.8713815Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8713991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8714088Z Autotune Choices Stats: 2025-12-04T09:41:43.8714921Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8715019Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8715108Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8715212Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8715695Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8716175Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8716642Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8717188Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8717693Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8718163Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8718664Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8719133Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8719684Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8720149Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8720480Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.8720570Z Autotune Choices Stats: 2025-12-04T09:41:43.8721413Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8721554Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8721639Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8721750Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8722268Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8722740Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8723206Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8723681Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8724147Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8724618Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8725085Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8725555Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8726031Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8726539Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8726874Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.8727048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8727141Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8727274Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8727518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8728927Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8729023Z graph_break [] 2025-12-04T09:41:43.8729131Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8729312Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8729400Z Autotune Choices Stats: 2025-12-04T09:41:43.8730234Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8730368Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8730454Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8730559Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8731037Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8731556Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8732045Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8732520Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8733005Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8733471Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8733941Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8734411Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8734896Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8735469Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8735847Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.8735944Z Autotune Choices Stats: 2025-12-04T09:41:43.8736779Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8736872Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8736960Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8737061Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8737617Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8738148Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8738613Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8739078Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8739543Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8740047Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8740515Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8741025Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8741500Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8741968Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8742299Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.8742474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8742575Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8742705Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8742951Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8744326Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8744411Z graph_break [] 2025-12-04T09:41:43.8744521Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8744693Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8744827Z Autotune Choices Stats: 2025-12-04T09:41:43.8745742Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8745839Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8745928Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8746032Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8746514Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8747035Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8747558Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8748031Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8748506Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8748985Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8749495Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8749971Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8750487Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8750965Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8751301Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.8751397Z Autotune Choices Stats: 2025-12-04T09:41:43.8752241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8752343Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8752428Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8752536Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8753016Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8753488Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8753969Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8754435Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8754947Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8755413Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8755888Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8756402Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8756877Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8757350Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8757678Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.8757857Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8757950Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8758081Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8758374Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8759320Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8759408Z graph_break [] 2025-12-04T09:41:43.8759626Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8759806Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8759902Z Autotune Choices Stats: 2025-12-04T09:41:43.8760739Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8760839Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8760925Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8761029Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8761510Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8761986Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8762462Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8762934Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8763417Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8763957Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8764432Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8764904Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8765416Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8765890Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8766220Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.8766310Z Autotune Choices Stats: 2025-12-04T09:41:43.8767167Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8767278Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8767376Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8767480Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8768000Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8768470Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8768976Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8769458Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8769924Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8770393Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8770866Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8771337Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8771818Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8772290Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8772625Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.8772718Z Autotune Choices Stats: 2025-12-04T09:41:43.8773617Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8773715Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8773801Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8773916Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8774394Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8774907Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8775386Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8775864Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8776343Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8776818Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8777387Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8777916Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8778440Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8778919Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8779247Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.8779432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8779526Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8779656Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8779907Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8781290Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8781379Z graph_break [] 2025-12-04T09:41:43.8781481Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8781654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8781754Z Autotune Choices Stats: 2025-12-04T09:41:43.8782581Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8782717Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8782803Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8782912Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8783392Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8783869Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8784385Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8784861Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8785338Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8785815Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8786289Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8786804Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8787281Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8787789Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8788126Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.8788218Z Autotune Choices Stats: 2025-12-04T09:41:43.8789057Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8789153Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8789244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8789348Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8789834Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8790316Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8790788Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8791266Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8791735Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8792245Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8792714Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8793187Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8793706Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8794180Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8794519Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.8794695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8794789Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8794923Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8795168Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8796117Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8796240Z graph_break [] 2025-12-04T09:41:43.8796351Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8796528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8796624Z Autotune Choices Stats: 2025-12-04T09:41:43.8797492Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8797607Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8797697Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8797828Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8798309Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8798793Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8799270Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8799824Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8800581Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8801073Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8801652Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8802122Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8802601Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8803120Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8803452Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.8803548Z Autotune Choices Stats: 2025-12-04T09:41:43.8804374Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8804470Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8804555Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8804658Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8805138Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8805603Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8806124Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8806672Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8807142Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8807615Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8808121Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8808594Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8809066Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8809540Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8809863Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.8809958Z Autotune Choices Stats: 2025-12-04T09:41:43.8810782Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.8810915Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8811004Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8811115Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8811586Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8812059Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8812565Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8813034Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8813500Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8813967Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8814429Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8814902Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8815408Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8815919Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8816253Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.8816427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8816520Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8816658Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8816903Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8818275Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8818362Z graph_break [] 2025-12-04T09:41:43.8818474Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8818648Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8818738Z Autotune Choices Stats: 2025-12-04T09:41:43.8819568Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8819666Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8819752Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8819902Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8820376Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8820860Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8821329Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8821843Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8822314Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8822781Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8823247Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8823709Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8824226Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8824685Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8825018Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.8825148Z Autotune Choices Stats: 2025-12-04T09:41:43.8825979Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8826080Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8826165Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8826274Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8826757Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8827235Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8827718Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8828222Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8828698Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8829163Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8829667Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8830146Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8830613Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8831123Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8831460Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.8831641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8831735Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8831863Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8832112Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8833486Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8833612Z graph_break [] 2025-12-04T09:41:43.8833715Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8833887Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8833986Z Autotune Choices Stats: 2025-12-04T09:41:43.8834845Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8834941Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8835031Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8835134Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8835614Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8836095Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8836576Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8837049Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8837519Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8838039Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8838510Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8839016Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8839551Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8840029Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8840402Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.8840498Z Autotune Choices Stats: 2025-12-04T09:41:43.8841353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8841449Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8841534Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8841647Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8842125Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8842606Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8843150Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8843634Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8844141Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8844610Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8845092Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8845569Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8846055Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8846529Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8846871Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.8847044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8847138Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8847281Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8847523Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8848902Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8849035Z graph_break [] 2025-12-04T09:41:43.8849141Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8849324Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8849416Z Autotune Choices Stats: 2025-12-04T09:41:43.8850295Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.8850402Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8850488Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8850597Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8851078Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8851548Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8852021Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8852539Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8853020Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8853542Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8854041Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8854529Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8855002Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8855485Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8855818Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.8855920Z Autotune Choices Stats: 2025-12-04T09:41:43.8856756Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8856856Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8856952Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8857061Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8857593Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8858119Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8858599Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8859069Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8859574Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8860049Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8860521Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8860997Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8861470Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8861982Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8862318Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.8862491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8862630Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8862765Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8863011Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8863960Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8864045Z graph_break [] 2025-12-04T09:41:43.8864159Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8864335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8864429Z Autotune Choices Stats: 2025-12-04T09:41:43.8865272Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8865365Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8865457Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8865562Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8866044Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8866530Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8867046Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8867528Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8868051Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8868563Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8869048Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8869516Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8869991Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8870458Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8870799Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.8870934Z Autotune Choices Stats: 2025-12-04T09:41:43.8871777Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8871883Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8872010Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8872125Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8872601Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8873072Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8873552Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8874029Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8874508Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8874975Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8875441Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8875923Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8876439Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8876920Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8877251Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.8877351Z Autotune Choices Stats: 2025-12-04T09:41:43.8878242Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8878340Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8878439Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8878556Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8879042Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8879591Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8880057Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8880591Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8881058Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8881572Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8882045Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8882519Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8882994Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8883466Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8883825Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.8884087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8884191Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8884322Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8884568Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8885544Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8885687Z graph_break [] 2025-12-04T09:41:43.8885802Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8886025Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8886119Z Autotune Choices Stats: 2025-12-04T09:41:43.8886991Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.8887087Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8887199Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8887315Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8887869Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8888348Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8888822Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8889302Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8889785Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8890307Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8890780Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8891287Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8891800Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8892304Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8892653Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.8892754Z Autotune Choices Stats: 2025-12-04T09:41:43.8893592Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8893694Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8893783Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8893900Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8894403Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8894921Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8895415Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8895940Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8896418Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8896884Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8897486Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8897966Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8898446Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8898929Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8899286Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.8899430Z Autotune Choices Stats: 2025-12-04T09:41:43.8900575Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8900689Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8900776Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8900890Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8901458Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8901930Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8902404Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8902888Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8903374Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8903853Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8904324Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8904817Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8905283Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8905816Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8906147Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.8906322Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8906423Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8906555Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8906860Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8908249Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8908337Z graph_break [] 2025-12-04T09:41:43.8908448Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8908624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8908717Z Autotune Choices Stats: 2025-12-04T09:41:43.8909557Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.8909707Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8909803Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8909911Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8910420Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8910891Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8911360Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8911845Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8912316Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8912800Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8913269Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8913772Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8914248Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8914718Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8915128Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.8915221Z Autotune Choices Stats: 2025-12-04T09:41:43.8916064Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8916163Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8916253Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8916401Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8916877Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8917345Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8917875Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8918340Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8918815Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8919318Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8919918Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8920393Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8920873Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8921346Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8921673Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.8921855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8921951Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8922084Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8922335Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8923276Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8923367Z graph_break [] 2025-12-04T09:41:43.8923476Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8923650Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8923746Z Autotune Choices Stats: 2025-12-04T09:41:43.8924617Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8924717Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8924805Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8924909Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8925390Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8925962Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.8926448Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8926923Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8927440Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8927910Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8928418Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8928888Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8929390Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8929907Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8930239Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.8930331Z Autotune Choices Stats: 2025-12-04T09:41:43.8931171Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:43.8939675Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8939804Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8939920Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8940439Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8940926Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8941398Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8941892Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8942442Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8942910Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8943385Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8943905Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8944389Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8944865Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8945208Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:43.8945308Z Autotune Choices Stats: 2025-12-04T09:41:43.8946154Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8946302Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8946393Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8946515Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8947000Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8947564Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8948042Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8948518Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8949002Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8949486Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8949967Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8950444Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8950916Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8951396Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8951767Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:43.8951955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8952055Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8952192Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8952449Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8953886Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8953987Z graph_break [] 2025-12-04T09:41:43.8954096Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8954273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8954383Z Autotune Choices Stats: 2025-12-04T09:41:43.8955232Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.8955336Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8955424Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8955578Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8956124Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8956600Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8957150Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8957679Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8958198Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8958687Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8959159Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8959752Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8960222Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8960699Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8961034Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:43.8961182Z Autotune Choices Stats: 2025-12-04T09:41:43.8962051Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8962150Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8962244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8962356Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8962843Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8963366Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8963846Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8964323Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8964790Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8965264Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8965783Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8966261Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8966776Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8967251Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8967585Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:43.8967768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8967866Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8968006Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8968256Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8969244Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8969333Z graph_break [] 2025-12-04T09:41:43.8969439Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8969622Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8969720Z Autotune Choices Stats: 2025-12-04T09:41:43.8970781Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8970995Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8971119Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8971273Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8971941Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8972736Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8973330Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8973807Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8974282Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8974762Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8975236Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8975714Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8976227Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8976745Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8977104Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:43.8977215Z Autotune Choices Stats: 2025-12-04T09:41:43.8978075Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.8978182Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8978272Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8978380Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.8978867Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8979346Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8979835Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8980308Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8980782Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8981299Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8981771Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8982253Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8982813Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8983304Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8985379Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:43.8985478Z Autotune Choices Stats: 2025-12-04T09:41:43.8986496Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.8986626Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8986751Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8986900Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.8987641Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8988350Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8989045Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8989737Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8990403Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8991085Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8991631Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.8992115Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8992597Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.8993072Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.8993415Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:43.8993598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.8993787Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.8993937Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.8994190Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.8995596Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.8995730Z graph_break [] 2025-12-04T09:41:43.8995840Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.8996024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.8996124Z Autotune Choices Stats: 2025-12-04T09:41:43.8996995Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.8997092Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.8997183Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.8997298Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.8997787Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8998343Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.8998825Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.8999358Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.8999938Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9000791Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9001280Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9001752Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9002228Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9002694Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9003031Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.9003139Z Autotune Choices Stats: 2025-12-04T09:41:43.9003985Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9004205Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9004294Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9004408Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9004897Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9005368Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9005912Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9006382Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9006856Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9007384Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9007853Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9008400Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9008878Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9009418Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9009755Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:43.9009933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9010038Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9010177Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9010446Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9011386Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9011477Z graph_break [] 2025-12-04T09:41:43.9011591Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9011769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9011865Z Autotune Choices Stats: 2025-12-04T09:41:43.9012719Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9012823Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9012926Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9013039Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9013568Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9014046Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9014517Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9015049Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9015541Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9016042Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9016512Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9016986Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9017519Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9018031Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9018380Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:43.9018479Z Autotune Choices Stats: 2025-12-04T09:41:43.9019357Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9019462Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9019552Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9019671Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9020154Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9020631Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9021118Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9021588Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9022067Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9022542Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9023059Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9023539Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9024020Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9024540Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9024878Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:43.9024983Z Autotune Choices Stats: 2025-12-04T09:41:43.9025824Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.9025932Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9026026Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9026142Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9026626Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9027103Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9027626Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9028142Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9028615Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9029098Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9029583Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9030071Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9030546Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9031019Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9031348Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:43.9031525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9031632Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9031769Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9032019Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9033017Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9033102Z graph_break [] 2025-12-04T09:41:43.9033219Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9033395Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9033489Z Autotune Choices Stats: 2025-12-04T09:41:43.9034400Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9034505Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9034602Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9034710Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9035195Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9035691Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9036162Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9036679Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9037150Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9037687Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9038187Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9038656Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9039129Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9039661Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9040004Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:43.9040097Z Autotune Choices Stats: 2025-12-04T09:41:43.9040943Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9041049Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9041142Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9041256Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9041734Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9042258Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9042740Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9043215Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9043730Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9044237Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9044919Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9045579Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9046250Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9047115Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9047634Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:43.9047804Z Autotune Choices Stats: 2025-12-04T09:41:43.9049072Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:43.9049216Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9049353Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9049468Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9049962Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9050434Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9050912Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9051384Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9051847Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9052322Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9052795Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9053327Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9053800Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9054272Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9054650Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:43.9054832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9054939Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9055075Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9055320Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9056710Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9056794Z graph_break [] 2025-12-04T09:41:43.9056953Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9057129Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9057221Z Autotune Choices Stats: 2025-12-04T09:41:43.9058144Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9058246Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9058342Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9058447Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9058920Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9059407Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9059891Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9060379Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9060842Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9061310Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9061781Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9062244Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9062760Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9063229Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9063562Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:43.9063655Z Autotune Choices Stats: 2025-12-04T09:41:43.9064527Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:43.9064630Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9064718Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9064834Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9065314Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9065789Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9066272Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9066786Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9067293Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9067818Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9068290Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9068763Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9069236Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9069716Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9070041Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:43.9070219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9070314Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9070443Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9070694Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9071636Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9071793Z graph_break [] 2025-12-04T09:41:43.9071898Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9072074Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9072169Z Autotune Choices Stats: 2025-12-04T09:41:43.9072998Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9073097Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9073226Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9073331Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9073815Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9074309Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9074785Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9075256Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9075780Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9076257Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9076761Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9077266Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9077756Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9078229Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9078559Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:43.9078654Z Autotune Choices Stats: 2025-12-04T09:41:43.9079571Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9079675Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9079766Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9079871Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9080348Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9080845Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9081364Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9081849Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9082315Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9082818Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9083297Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9083772Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9084250Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9084722Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9085098Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:43.9085192Z Autotune Choices Stats: 2025-12-04T09:41:43.9086019Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.9086122Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9086248Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9086364Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9086843Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9087317Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9087790Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9088257Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9088733Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9089201Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9089665Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9090146Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9090708Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9091184Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9091512Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:43.9091695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9091791Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9091962Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9092219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9093601Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9093693Z graph_break [] 2025-12-04T09:41:43.9093800Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9093977Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9094076Z Autotune Choices Stats: 2025-12-04T09:41:43.9094913Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9095059Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9095147Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9095253Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9095774Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9096249Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9096738Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9097223Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9097729Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9098225Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9098692Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9099164Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9099632Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9100159Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9100832Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:43.9100928Z Autotune Choices Stats: 2025-12-04T09:41:43.9101784Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9101968Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9102057Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9102168Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9102878Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9103624Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9104448Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9105268Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9106192Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9106982Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9107969Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9108678Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9109356Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9109836Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9110177Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:43.9110355Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9110454Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9110594Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9110840Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9112216Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9112394Z graph_break [] 2025-12-04T09:41:43.9112501Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9112682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9112776Z Autotune Choices Stats: 2025-12-04T09:41:43.9113607Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3249", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9113709Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9113797Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9113910Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9114429Z triton_mm_3249 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9114913Z triton_mm_3251 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9115394Z triton_mm_3252 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9115866Z triton_mm_3253 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9116348Z triton_mm_3254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9116866Z triton_mm_3255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9117341Z triton_mm_3246 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9117843Z triton_mm_3244 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9118316Z triton_mm_3245 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9118803Z triton_mm_3247 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9119137Z SingleProcess AUTOTUNE benchmarking takes 0.2118 seconds and 0.6062 seconds precompiling for 15 choices 2025-12-04T09:41:43.9119242Z Autotune Choices Stats: 2025-12-04T09:41:43.9120147Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3275", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9120250Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9120339Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9120444Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9120927Z triton_mm_3275 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9121399Z triton_mm_3276 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9121866Z triton_mm_3277 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9122959Z triton_mm_3280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9123422Z triton_mm_3274 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9123893Z triton_mm_3279 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9124402Z triton_mm_3278 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9124885Z triton_mm_3283 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9125358Z triton_mm_3284 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9125835Z triton_mm_3281 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9126165Z SingleProcess AUTOTUNE benchmarking takes 0.1792 seconds and 1.8272 seconds precompiling for 13 choices 2025-12-04T09:41:43.9126430Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.9126542Z Traceback (most recent call last): 2025-12-04T09:41:43.9126962Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.9127172Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:43.9127551Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:43.9127772Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:43.9127946Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:43.9128033Z Searched string: 2025-12-04T09:41:43.9128171Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:43.9128178Z 2025-12-04T09:41:43.9128303Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:43.9128308Z 2025-12-04T09:41:43.9128440Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.9128567Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:43.9128579Z 2025-12-04T09:41:43.9128674Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:43.9128769Z idx_n = a_k_idx_vals 2025-12-04T09:41:43.9128867Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.9128962Z a = tl.load(A + (xindex)) 2025-12-04T09:41:43.9128966Z 2025-12-04T09:41:43.9129054Z idx_m = b_k_idx_vals 2025-12-04T09:41:43.9129158Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:43.9129251Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.9129340Z b = tl.load(B + (xindex)) 2025-12-04T09:41:43.9129350Z 2025-12-04T09:41:43.9129354Z 2025-12-04T09:41:43.9129514Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:43.9129519Z 2025-12-04T09:41:43.9129522Z 2025-12-04T09:41:43.9129642Z # rematerialize rm and rn to save registers 2025-12-04T09:41:43.9129768Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:43.9129881Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:43.9129969Z idx_m = rm[:, None] 2025-12-04T09:41:43.9130061Z idx_n = rn[None, :] 2025-12-04T09:41:43.9130204Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:43.9130208Z 2025-12-04T09:41:43.9130310Z # inductor generates a suffix 2025-12-04T09:41:43.9130402Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:43.9130614Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:43.9130710Z ''', device_str='cuda') 2025-12-04T09:41:43.9130715Z 2025-12-04T09:41:43.9130718Z 2025-12-04T09:41:43.9130820Z async_compile.wait(globals()) 2025-12-04T09:41:43.9130904Z del async_compile 2025-12-04T09:41:43.9130908Z 2025-12-04T09:41:43.9130995Z class Runner: 2025-12-04T09:41:43.9131097Z def __init__(self, partitions): 2025-12-04T09:41:43.9131206Z self.partitions = partitions 2025-12-04T09:41:43.9131213Z 2025-12-04T09:41:43.9131362Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:43.9131454Z new_callables = [] 2025-12-04T09:41:43.9131582Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:43.9131690Z new_callables.append(fn(c)) 2025-12-04T09:41:43.9131798Z self.partitions = new_callables 2025-12-04T09:41:43.9131802Z 2025-12-04T09:41:43.9131903Z def call(self, args): 2025-12-04T09:41:43.9131992Z arg0_1, arg1_1 = args 2025-12-04T09:41:43.9132083Z args.clear() 2025-12-04T09:41:43.9132214Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.9132339Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:43.9132450Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:43.9132548Z torch.cuda.set_device(0) 2025-12-04T09:41:43.9132716Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.9132942Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:43.9133085Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.9133275Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:43.9133366Z del arg0_1 2025-12-04T09:41:43.9133534Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:43.9133791Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:43.9133932Z stream0 = get_raw_stream(0) 2025-12-04T09:41:43.9134153Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:43.9134242Z del arg1_1 2025-12-04T09:41:43.9134327Z del buf0 2025-12-04T09:41:43.9134412Z return (buf1, ) 2025-12-04T09:41:43.9134417Z 2025-12-04T09:41:43.9134523Z runner = Runner(partitions=[]) 2025-12-04T09:41:43.9134609Z call = runner.call 2025-12-04T09:41:43.9134771Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:43.9134781Z 2025-12-04T09:41:43.9134784Z 2025-12-04T09:41:43.9134923Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:43.9135053Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:43.9135211Z from torch._inductor.utils import print_performance 2025-12-04T09:41:43.9135417Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:43.9135617Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:43.9135723Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:43.9135886Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:43.9135890Z 2025-12-04T09:41:43.9135894Z 2025-12-04T09:41:43.9135992Z if __name__ == "__main__": 2025-12-04T09:41:43.9136190Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:43.9136350Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:43.9136445Z From CHECK: .to( 2025-12-04T09:41:43.9136449Z 2025-12-04T09:41:43.9136453Z 2025-12-04T09:41:43.9136627Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:43.9137196Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:43.9137254Z 2025-12-04T09:41:43.9137502Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:43.9137680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9137783Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9137913Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9138174Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9139593Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9139686Z graph_break [] 2025-12-04T09:41:43.9139807Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9139985Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9140087Z Autotune Choices Stats: 2025-12-04T09:41:43.9140926Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9141023Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9141189Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9141296Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9141781Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9142294Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9142758Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9143230Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9143685Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9144163Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9144630Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9145086Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9145548Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9146017Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9146366Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.9146504Z Autotune Choices Stats: 2025-12-04T09:41:43.9147341Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9147438Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9147528Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9147641Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9148151Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9148619Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9149078Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9149543Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9150005Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9150469Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9150973Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9151441Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9151949Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9152416Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9152747Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:43.9152937Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9153038Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9153174Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9153424Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9154370Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9154461Z graph_break [] 2025-12-04T09:41:43.9154570Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9154746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9154844Z Autotune Choices Stats: 2025-12-04T09:41:43.9155675Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9155816Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9155904Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9156009Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9156496Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9156965Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9157475Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9158003Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9158483Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9158970Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9159431Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9159968Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9160477Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9160983Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9161316Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:43.9161410Z Autotune Choices Stats: 2025-12-04T09:41:43.9162241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9162338Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9162432Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9162538Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9163012Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9163483Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9163952Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9164432Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9164898Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9165411Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9165877Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9166343Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9166966Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9167440Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9167784Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:43.9167898Z Autotune Choices Stats: 2025-12-04T09:41:43.9168751Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.9168855Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9168948Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9169067Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9169585Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9170045Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9170552Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9171021Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9171506Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9171978Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9172452Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9172927Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9173392Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9173866Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9174202Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:43.9174387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9174525Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9174663Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9174923Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9175861Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9175959Z graph_break [] 2025-12-04T09:41:43.9176064Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9176304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9176409Z Autotune Choices Stats: 2025-12-04T09:41:43.9177256Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.9177363Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9177456Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9177562Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9178048Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9178515Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9179031Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9179501Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9180007Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9180486Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9180954Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9181434Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9181902Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9182373Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9182799Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:43.9182940Z Autotune Choices Stats: 2025-12-04T09:41:43.9184267Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:43.9184509Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9184656Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9184830Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9185658Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9186493Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9187302Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9188220Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9189039Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9189860Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9190676Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9191488Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9192386Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9193210Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9193842Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:43.9193998Z Autotune Choices Stats: 2025-12-04T09:41:43.9195443Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9195612Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9195760Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9195952Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9196785Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9197625Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9198518Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9199311Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9200118Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9201237Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9202118Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9202854Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9203606Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9204503Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9205009Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:43.9205286Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9205425Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9205620Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9206001Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9208158Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9208433Z graph_break [] 2025-12-04T09:41:43.9208595Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9208909Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9209063Z Autotune Choices Stats: 2025-12-04T09:41:43.9210638Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9210801Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9210940Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9211098Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9211859Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9212562Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9213230Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9213742Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9214213Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9214691Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9215162Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9215718Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9216184Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9216653Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9217030Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:43.9217131Z Autotune Choices Stats: 2025-12-04T09:41:43.9218028Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9218129Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9224248Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9224385Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9224900Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9225386Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9225929Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9226405Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9226910Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9227375Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9227903Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9228382Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9228868Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9229340Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9229676Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:43.9229859Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9229962Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9230103Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9230354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9231371Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9231460Z graph_break [] 2025-12-04T09:41:43.9231568Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9231751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9231844Z Autotune Choices Stats: 2025-12-04T09:41:43.9232722Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.9232828Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9232922Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9233035Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9233524Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9233999Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9234471Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9234982Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9235458Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9235967Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9236451Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9236927Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9237405Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9237883Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9238219Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:43.9238321Z Autotune Choices Stats: 2025-12-04T09:41:43.9239155Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9239252Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9239348Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9239456Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9240001Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9240518Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9240998Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9241472Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9241979Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9242455Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9242928Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9243405Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9243878Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9244354Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9244726Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:43.9244822Z Autotune Choices Stats: 2025-12-04T09:41:43.9245702Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9245799Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9245889Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9246009Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9246489Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9246980Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9247466Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9247969Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9248447Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9248919Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9249394Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9249909Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9250388Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9250861Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9251230Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:43.9251413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9251511Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9251648Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9251896Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9252842Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9252931Z graph_break [] 2025-12-04T09:41:43.9253035Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9253213Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9253305Z Autotune Choices Stats: 2025-12-04T09:41:43.9254177Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9254277Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9254364Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9254473Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9255048Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9255527Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9256010Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9256492Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9256986Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9257516Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9257990Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9258481Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9258948Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9259685Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9260192Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.9260344Z Autotune Choices Stats: 2025-12-04T09:41:43.9261519Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9261624Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9261716Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9261829Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9262306Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9262799Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9263270Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9263749Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9264264Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9264741Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9265270Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9265752Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9266228Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9266701Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9267045Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:43.9267140Z Autotune Choices Stats: 2025-12-04T09:41:43.9267968Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9268064Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9268151Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9268271Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9268745Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9269265Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9269738Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9270215Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9270733Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9271216Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9271707Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9272175Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9272650Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9273122Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9273492Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:43.9273678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9273776Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9273911Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9274197Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9275587Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9275677Z graph_break [] 2025-12-04T09:41:43.9275784Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9275964Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9276060Z Autotune Choices Stats: 2025-12-04T09:41:43.9276888Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9276989Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9277080Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9277190Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9277669Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9278143Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9278659Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9279139Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9279700Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9280220Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9280701Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9281189Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9281668Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9282144Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9282473Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:43.9282610Z Autotune Choices Stats: 2025-12-04T09:41:43.9283449Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9283546Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9283678Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9283789Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9284271Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9284752Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9285224Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9285699Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9286168Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9286636Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9287105Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9287589Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9288143Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9288614Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9288945Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:43.9289121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9289221Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9289394Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9289644Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9291029Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9291113Z graph_break [] 2025-12-04T09:41:43.9291222Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9291396Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9291487Z Autotune Choices Stats: 2025-12-04T09:41:43.9292331Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9292468Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9292557Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9292664Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9293192Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9293678Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9294144Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9294621Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9295091Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9295558Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9296035Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9296501Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9296980Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9297544Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9297881Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:43.9297972Z Autotune Choices Stats: 2025-12-04T09:41:43.9298810Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9298948Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9299037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9299143Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9299622Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9300096Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9300894Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9301368Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9301962Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9302433Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9302953Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9303430Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9303904Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9304468Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9304856Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:43.9305058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9305160Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9305299Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9305587Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9307352Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9307445Z graph_break [] 2025-12-04T09:41:43.9307613Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9307809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9307908Z Autotune Choices Stats: 2025-12-04T09:41:43.9308904Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.9309001Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9309089Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9309199Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9309820Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9310295Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9310776Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9311252Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9311724Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9312249Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9312727Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9313243Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9313711Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9314181Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9314516Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:43.9314609Z Autotune Choices Stats: 2025-12-04T09:41:43.9315454Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9315549Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9315642Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9315749Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9316225Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9316704Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9317211Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9317752Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9318217Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9318680Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9319201Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9319734Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9320211Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9320683Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9321015Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:43.9321189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9321328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9321464Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9321707Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9323128Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9323214Z graph_break [] 2025-12-04T09:41:43.9323322Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9323497Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9323593Z Autotune Choices Stats: 2025-12-04T09:41:43.9324424Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9324526Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9324611Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9324724Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9325196Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9325669Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9326145Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9326623Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9327175Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9327678Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9328165Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9328680Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9329152Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9329624Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9329952Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:43.9330047Z Autotune Choices Stats: 2025-12-04T09:41:43.9330878Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.9331018Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9331105Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9331213Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9331688Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9332192Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9332660Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9333134Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9333600Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9334074Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9334540Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9335015Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9335486Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9335963Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9336331Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:43.9336506Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9336602Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9336733Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9336977Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9338422Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9338515Z graph_break [] 2025-12-04T09:41:43.9338627Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9338808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9338901Z Autotune Choices Stats: 2025-12-04T09:41:43.9339745Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9339840Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9339971Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9340079Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9340564Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9341039Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9341543Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9342016Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9342481Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9342950Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9343424Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9343892Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9344361Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9344835Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9345173Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:43.9345307Z Autotune Choices Stats: 2025-12-04T09:41:43.9346167Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9346261Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9346349Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9346456Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9346972Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9347456Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9347984Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9348452Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9348917Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9349381Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9349898Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9350370Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9350878Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9351350Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9351674Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:43.9351856Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9351952Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9352088Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9352340Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9353281Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9353367Z graph_break [] 2025-12-04T09:41:43.9353472Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9353693Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9353841Z Autotune Choices Stats: 2025-12-04T09:41:43.9355012Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.9355247Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9355370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9355516Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9356196Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9356934Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9357832Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9358574Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9359322Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9360148Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9360922Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9361855Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9362675Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9363580Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9364110Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:43.9364310Z Autotune Choices Stats: 2025-12-04T09:41:43.9365633Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9365798Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9365941Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9366109Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9366882Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9367671Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9368482Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9369277Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9370080Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9370984Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9371777Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9372543Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9373458Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9374250Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9374802Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:43.9374947Z Autotune Choices Stats: 2025-12-04T09:41:43.9376240Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9376405Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9376535Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9376815Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9377604Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9378401Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9379273Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9380050Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9380837Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9381642Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9382450Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9383264Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9384083Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9384930Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9385491Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:43.9385940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9386093Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9386315Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9386721Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9388240Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9388389Z graph_break [] 2025-12-04T09:41:43.9388635Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9388933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9389091Z Autotune Choices Stats: 2025-12-04T09:41:43.9390336Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9390506Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9390646Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9390815Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9391600Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9392390Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9393266Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9394130Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9394914Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9395673Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9396440Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9397223Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9398072Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9398854Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9399430Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.9399685Z Autotune Choices Stats: 2025-12-04T09:41:43.9401277Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9401585Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9401733Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9401889Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9402637Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9403431Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9404353Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9405169Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9405951Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9406736Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9407558Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9408443Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9409427Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9410353Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9410923Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:43.9411080Z Autotune Choices Stats: 2025-12-04T09:41:43.9412505Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9412673Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9412815Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9413000Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9413818Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9414618Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9415465Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9416331Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9417157Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9418108Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9418938Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9419775Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9420681Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9421478Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9422069Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:43.9422379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9422532Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9422738Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9423129Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9424652Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9424870Z graph_break [] 2025-12-04T09:41:43.9425036Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9425333Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9425492Z Autotune Choices Stats: 2025-12-04T09:41:43.9426969Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9427158Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9427318Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9427502Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9428319Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9429109Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9429938Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9430744Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9431579Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9432397Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9433156Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9434068Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9434883Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9435682Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9436347Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:43.9436509Z Autotune Choices Stats: 2025-12-04T09:41:43.9437943Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.9438100Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9438250Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9438400Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9439182Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9440090Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9440985Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9441691Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9442531Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9443311Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9444113Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9444883Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9445680Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9446501Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9447078Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:43.9447221Z Autotune Choices Stats: 2025-12-04T09:41:43.9448598Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9448891Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9449033Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9449217Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9450061Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9450858Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9451664Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9452583Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9453392Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9454183Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9455004Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9455831Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9456728Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9457555Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9458202Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:43.9458513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9458664Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9458876Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9459299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9461634Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9461804Z graph_break [] 2025-12-04T09:41:43.9461984Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9462276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9462433Z Autotune Choices Stats: 2025-12-04T09:41:43.9463820Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9464006Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9464145Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9464318Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9465242Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9466046Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9466860Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9467527Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9468230Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9468712Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9469186Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9469658Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9470140Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9470669Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9471006Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:43.9471101Z Autotune Choices Stats: 2025-12-04T09:41:43.9471988Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9472087Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9472182Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9472291Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9472781Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9473256Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9473730Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9474203Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9474671Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9475146Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9475616Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9476131Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9476613Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9477083Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9477472Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:43.9477685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9477796Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9477937Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9478185Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9479719Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9479853Z graph_break [] 2025-12-04T09:41:43.9479964Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9480148Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9480244Z Autotune Choices Stats: 2025-12-04T09:41:43.9481140Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.9481239Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9481331Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9481444Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9481929Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9482412Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9482892Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9483368Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9483845Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9484320Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9484810Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9485324Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9485794Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9486277Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9486615Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:43.9486717Z Autotune Choices Stats: 2025-12-04T09:41:43.9487642Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9487749Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9487836Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9487941Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9488425Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9488899Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9489379Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9489923Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9490430Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9490909Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9491377Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9491863Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9492340Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9492822Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9493155Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.9493330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9493433Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9493566Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9493818Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9495205Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9495335Z graph_break [] 2025-12-04T09:41:43.9495446Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9495620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9495713Z Autotune Choices Stats: 2025-12-04T09:41:43.9496600Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9496702Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9496798Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9496905Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9497401Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9497936Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9498417Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9498907Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9499430Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9499947Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9500742Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9501215Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9501696Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9502163Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9502503Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:43.9502597Z Autotune Choices Stats: 2025-12-04T09:41:43.9503435Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9503531Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9503621Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9503740Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9504215Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9504791Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9505269Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9505736Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9506265Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9506735Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9507261Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9507734Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9508207Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9508688Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9509073Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:43.9509257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9509353Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9509539Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9509792Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9511167Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9511264Z graph_break [] 2025-12-04T09:41:43.9511368Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9511546Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9511647Z Autotune Choices Stats: 2025-12-04T09:41:43.9512482Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9512584Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9512672Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9512778Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9513262Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9513746Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9514277Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9514758Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9515243Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9515761Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9516234Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9516719Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9517186Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9517660Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9518031Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.9518128Z Autotune Choices Stats: 2025-12-04T09:41:43.9518967Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9519105Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9519205Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9519312Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9519848Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9520331Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9520807Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9521290Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9521758Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9522229Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9522700Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9523175Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9523702Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9524177Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9524513Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:43.9524691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9524790Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9524994Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9525242Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9526196Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9526286Z graph_break [] 2025-12-04T09:41:43.9526391Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9526571Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9526664Z Autotune Choices Stats: 2025-12-04T09:41:43.9527583Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.9527719Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9527806Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9527921Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9528408Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9528921Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9529390Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9529868Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9530345Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9530827Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9531302Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9531775Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9532252Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9532724Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9533095Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:43.9533202Z Autotune Choices Stats: 2025-12-04T09:41:43.9534022Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9534125Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9534217Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9534374Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9534859Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9535349Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9535822Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9536289Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9536759Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9537320Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9537787Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9538304Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9538778Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9539260Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9539591Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:43.9539690Z Autotune Choices Stats: 2025-12-04T09:41:43.9540524Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9540620Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9540712Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9540825Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9541297Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9541789Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9542301Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9542781Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9543266Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9549684Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9550223Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9550723Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9551217Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9551696Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9552058Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:43.9552398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9552548Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9552753Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9553124Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9555375Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9555509Z graph_break [] 2025-12-04T09:41:43.9555659Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9555897Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9555998Z Autotune Choices Stats: 2025-12-04T09:41:43.9556863Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:43.9556967Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9557075Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9557212Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9557729Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9558218Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9558702Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9559232Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9559809Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9560285Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9560813Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9561298Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9561799Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9562274Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9562617Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:43.9562723Z Autotune Choices Stats: 2025-12-04T09:41:43.9563573Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9563726Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9563821Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9563931Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9564463Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9564942Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9565425Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9565903Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9566377Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9566859Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9567386Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9567875Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9568355Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9568903Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9569239Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:43.9569420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9569525Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9569660Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9569906Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9571364Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9571456Z graph_break [] 2025-12-04T09:41:43.9571572Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9571751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9571851Z Autotune Choices Stats: 2025-12-04T09:41:43.9572697Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9572839Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9572934Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9573044Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9573534Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9574065Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9574546Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9575030Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9575521Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9576000Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9576480Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9576960Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9577448Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9577922Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9578307Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:43.9578403Z Autotune Choices Stats: 2025-12-04T09:41:43.9579267Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9579364Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9579455Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9579569Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9580091Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9580571Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9581049Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9581522Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9581999Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9582510Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9582986Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9583505Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9583989Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9584466Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9584803Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:43.9584987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9585090Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9585230Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9585488Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9586876Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9586971Z graph_break [] 2025-12-04T09:41:43.9587100Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9587315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9587454Z Autotune Choices Stats: 2025-12-04T09:41:43.9588310Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9588412Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9588501Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9588609Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9589095Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9589620Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9590105Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9590590Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9591076Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9591550Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9592066Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9592550Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9593063Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9593541Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9593875Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:43.9593979Z Autotune Choices Stats: 2025-12-04T09:41:43.9594823Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9594923Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9595018Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9595127Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9595611Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9596096Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9596570Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9597048Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9597586Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9598091Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9598562Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9599080Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9599631Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9600122Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9600770Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:43.9600947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9601044Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9601178Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9601522Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9602977Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9603065Z graph_break [] 2025-12-04T09:41:43.9603173Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9603348Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9603439Z Autotune Choices Stats: 2025-12-04T09:41:43.9604276Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9604372Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9604461Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9604573Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9605058Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9605533Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9606001Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9606475Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9606954Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9607613Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9608348Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9609010Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9609697Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9610364Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9610792Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:43.9610891Z Autotune Choices Stats: 2025-12-04T09:41:43.9611725Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9611825Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9611962Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9612070Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9612553Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9613028Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9613543Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9614009Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9614481Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9614956Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9615426Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9615902Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9616381Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9616866Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9617198Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:43.9617423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9617539Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9617691Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9617948Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9618880Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9618966Z graph_break [] 2025-12-04T09:41:43.9619115Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9619292Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9619388Z Autotune Choices Stats: 2025-12-04T09:41:43.9620223Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9620316Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9620406Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9620510Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9620989Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9621508Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9621982Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9622508Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9622989Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9623476Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9623951Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9624423Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9624900Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9625363Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9625694Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:43.9625786Z Autotune Choices Stats: 2025-12-04T09:41:43.9626624Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9626757Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9626843Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9626952Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9627439Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9627946Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9628448Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9628923Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9629397Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9629861Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9630330Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9630801Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9631321Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9631829Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9632156Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:43.9632257Z Autotune Choices Stats: 2025-12-04T09:41:43.9633080Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9633182Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9633268Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9633378Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9633862Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9634334Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9634808Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9635280Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9635762Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9636282Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9636760Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9637241Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9637763Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9638251Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9638580Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:43.9638754Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9638853Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9638983Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9639235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9640677Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9640836Z graph_break [] 2025-12-04T09:41:43.9640940Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9641113Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9641248Z Autotune Choices Stats: 2025-12-04T09:41:43.9642077Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9642170Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9642264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9642373Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9642852Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9643329Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9643796Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9644286Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9644760Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9645236Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9645750Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9646229Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9646708Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9647206Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9647545Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:43.9647661Z Autotune Choices Stats: 2025-12-04T09:41:43.9648537Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9648629Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9648714Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9648827Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9649302Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9649822Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9650293Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9650799Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9651271Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9651738Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9652210Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9652680Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9653161Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9653631Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9653965Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:43.9654151Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9654245Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9654380Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9654663Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9655612Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9655704Z graph_break [] 2025-12-04T09:41:43.9655808Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9655991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9656084Z Autotune Choices Stats: 2025-12-04T09:41:43.9656954Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9657059Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9657145Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9657252Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9657787Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9658265Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9658746Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9659267Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9659802Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9660280Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9660761Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9661235Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9661891Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9662565Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9662899Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:43.9662999Z Autotune Choices Stats: 2025-12-04T09:41:43.9664098Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9664196Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9664286Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9664429Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9665153Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9665679Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9666146Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9666667Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9667138Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9667610Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9668077Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9668558Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9669033Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9669543Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9669884Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:43.9669977Z Autotune Choices Stats: 2025-12-04T09:41:43.9670857Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:43.9670951Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9671037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9671156Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9671639Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9672112Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9672585Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9673056Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9673534Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9673999Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9674531Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9675012Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9675490Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9676023Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9676358Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:43.9676538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9676636Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9676773Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9677021Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9678446Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9678578Z graph_break [] 2025-12-04T09:41:43.9678681Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9678861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9678959Z Autotune Choices Stats: 2025-12-04T09:41:43.9679931Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9680032Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9680118Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9680222Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9680708Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9681207Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9681687Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9682170Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9682655Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9683129Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9683598Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9684112Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9684594Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9685079Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9685410Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:43.9685547Z Autotune Choices Stats: 2025-12-04T09:41:43.9686393Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9686489Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9686581Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9686687Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9687166Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9687681Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9688219Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9688696Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9689205Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9689682Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9690150Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9690648Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9691129Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9691611Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9691956Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:43.9692134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9692241Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9692375Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9692624Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9694016Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9694151Z graph_break [] 2025-12-04T09:41:43.9694264Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9694440Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9694537Z Autotune Choices Stats: 2025-12-04T09:41:43.9695433Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9695532Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9695630Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9695738Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9696221Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9696713Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9697196Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9697718Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9698191Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9698704Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9699192Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9699663Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9700155Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9700938Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9701286Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:43.9701379Z Autotune Choices Stats: 2025-12-04T09:41:43.9702228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9702334Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9702427Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9702541Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9703019Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9703575Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9704045Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9704510Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9705039Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9705505Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9705982Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9706454Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9706925Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9707507Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9707831Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:43.9708017Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9708113Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9708293Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9708543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9709911Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9710004Z graph_break [] 2025-12-04T09:41:43.9710108Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9710285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9710381Z Autotune Choices Stats: 2025-12-04T09:41:43.9711212Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:43.9711311Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9711397Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9711504Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9711981Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9712449Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9712999Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9713557Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9714316Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9715184Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9715938Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9716622Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9717403Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9718066Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9718591Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:43.9718728Z Autotune Choices Stats: 2025-12-04T09:41:43.9720044Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9720249Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9720370Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9720505Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9721259Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9722014Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9722767Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9723429Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9723960Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9724432Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9724914Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9725392Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9725944Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9726419Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9726760Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:43.9726951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9727091Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9727229Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9727477Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9728434Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9728518Z graph_break [] 2025-12-04T09:41:43.9728624Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9728806Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9728939Z Autotune Choices Stats: 2025-12-04T09:41:43.9729869Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9730026Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9730115Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9730229Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9730741Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9731214Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9731695Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9732173Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9732656Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9733139Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9733618Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9734081Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9734552Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9735064Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9735393Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:43.9735492Z Autotune Choices Stats: 2025-12-04T09:41:43.9736316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9736417Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9736546Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9736653Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9737136Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9737633Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9738124Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9738601Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9739065Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9739588Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9740103Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9740591Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9741061Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9741537Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9741871Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:43.9741964Z Autotune Choices Stats: 2025-12-04T09:41:43.9742796Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9742893Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9742983Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9743101Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9743572Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9744046Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9744554Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9745030Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9745510Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9746013Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9746494Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9746968Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9747450Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9747964Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9748309Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:43.9748548Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9748644Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9748785Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9749028Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9750032Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9750123Z graph_break [] 2025-12-04T09:41:43.9750227Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9750407Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9750502Z Autotune Choices Stats: 2025-12-04T09:41:43.9751335Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.9751438Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9751524Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9751635Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9752121Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9752591Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9753063Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9753536Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9754065Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9754536Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9755008Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9755514Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9755980Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9756459Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9756789Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:43.9756886Z Autotune Choices Stats: 2025-12-04T09:41:43.9757711Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9757844Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9757936Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9758043Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9758519Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9759035Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9759832Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9760581Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9761236Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9761806Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9762456Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9763103Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9763729Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9764384Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9764972Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:43.9765110Z Autotune Choices Stats: 2025-12-04T09:41:43.9766244Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9766383Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9766506Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9766671Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9767315Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9767803Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9768283Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9768760Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9769253Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9769768Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9770251Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9770767Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9771252Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9771724Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9772057Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:43.9772241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9772337Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9772477Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9772727Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9774112Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9774220Z graph_break [] 2025-12-04T09:41:43.9774370Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9774634Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9774834Z Autotune Choices Stats: 2025-12-04T09:41:43.9775823Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.9775926Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9776013Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9776127Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9776650Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9777207Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9777822Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9778386Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9778948Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9779514Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9780208Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9780838Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9781356Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9781841Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9782186Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:43.9782287Z Autotune Choices Stats: 2025-12-04T09:41:43.9783117Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9783213Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9783308Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9783415Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9783899Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9784369Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9784850Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9785372Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9785844Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9786315Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9786817Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9787300Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9787774Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9788247Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9788584Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:43.9788759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9788860Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9789066Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9789319Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9790278Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9790364Z graph_break [] 2025-12-04T09:41:43.9790525Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9790705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9790800Z Autotune Choices Stats: 2025-12-04T09:41:43.9791639Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9791735Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9791823Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9791936Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9792418Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9792908Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:43.9793391Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9793874Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9794345Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9794854Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9795332Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9795799Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9796325Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9796813Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9797155Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:43.9797248Z Autotune Choices Stats: 2025-12-04T09:41:43.9798134Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:43.9798237Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9798326Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9798481Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9798977Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9799451Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9800034Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9801624Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9802108Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9802574Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9803045Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9803526Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9803998Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9804476Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9804806Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:43.9805008Z Autotune Choices Stats: 2025-12-04T09:41:43.9805834Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9805928Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9806022Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9806136Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9806608Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9807147Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9807641Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9808141Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9808610Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9809096Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9809627Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9810118Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9810654Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9811121Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9811457Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:43.9811637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9811742Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9811876Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9812125Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9813514Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9813600Z graph_break [] 2025-12-04T09:41:43.9813713Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9813891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9813990Z Autotune Choices Stats: 2025-12-04T09:41:43.9814842Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9814981Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9815070Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9815184Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9815675Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9816149Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9816660Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9817146Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9817622Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9818097Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9818578Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9819088Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9819564Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9820068Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9820402Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:43.9820500Z Autotune Choices Stats: 2025-12-04T09:41:43.9821334Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9821439Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9821530Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9821635Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9822125Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9822598Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9823077Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9823557Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9824031Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9824541Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9825004Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9825482Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9826016Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9826494Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9826825Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:43.9827004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9827100Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9827232Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9827482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9828473Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9828607Z graph_break [] 2025-12-04T09:41:43.9828712Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9828886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9828986Z Autotune Choices Stats: 2025-12-04T09:41:43.9829857Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9829952Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9830046Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9830152Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9830642Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9831116Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9831587Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9832058Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9832524Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9833005Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9833511Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9833991Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9834456Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9834957Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9835301Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:43.9835397Z Autotune Choices Stats: 2025-12-04T09:41:43.9836239Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9836334Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9836421Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9836533Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9837008Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9842863Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9843380Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9843926Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9844401Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9844866Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9845345Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9845819Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9846300Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9846774Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9847106Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:43.9847208Z Autotune Choices Stats: 2025-12-04T09:41:43.9848048Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:43.9848196Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9848287Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9848401Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9848891Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9849362Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9849876Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9850353Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9850832Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9851319Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9851796Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9852316Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9852783Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9853302Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9853633Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:43.9853813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9853916Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9854049Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9854308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9855684Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9855773Z graph_break [] 2025-12-04T09:41:43.9855891Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9856070Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9856173Z Autotune Choices Stats: 2025-12-04T09:41:43.9857013Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9857113Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9857208Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9857361Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9857900Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9858381Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9858856Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9859401Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9859878Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9860362Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9860833Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9861305Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9861819Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9862298Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9862638Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:43.9862772Z Autotune Choices Stats: 2025-12-04T09:41:43.9863617Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9863712Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9863799Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9863921Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9864401Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9864878Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9865348Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9865815Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9866285Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9866750Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9867291Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9867768Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9868296Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9868807Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9869144Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:43.9869329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9869425Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9869562Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9869810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9870750Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9870836Z graph_break [] 2025-12-04T09:41:43.9870984Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9871165Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9871261Z Autotune Choices Stats: 2025-12-04T09:41:43.9872154Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:43.9872255Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9872345Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9872454Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9872940Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9873408Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9873892Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9874371Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9874846Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9875325Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9875804Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9876281Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9876801Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9877271Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9877604Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:43.9877703Z Autotune Choices Stats: 2025-12-04T09:41:43.9878572Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9878673Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9878767Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9878873Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9879349Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9879917Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9880474Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9881079Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9881634Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9882225Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9882777Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9883337Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9883900Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9884459Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9884852Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:43.9884949Z Autotune Choices Stats: 2025-12-04T09:41:43.9885954Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:43.9886055Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9886148Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9886270Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9886877Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9887437Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9887991Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9888583Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9889147Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9889707Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9890279Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9890834Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9891391Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9891980Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9892365Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:43.9892603Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9892706Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9892847Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9893132Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9894287Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9894384Z graph_break [] 2025-12-04T09:41:43.9894499Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9894699Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9894803Z Autotune Choices Stats: 2025-12-04T09:41:43.9895808Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9895912Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9896004Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9896118Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9896696Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9897264Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9897918Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9898469Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9899023Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9899619Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9900172Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9900985Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9901461Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9901937Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9902359Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:43.9902456Z Autotune Choices Stats: 2025-12-04T09:41:43.9903305Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9903487Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9903581Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9903688Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9904163Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9904643Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9905119Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9905598Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9906067Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9906539Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9907006Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9907480Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9908016Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9908502Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9908843Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:43.9908937Z Autotune Choices Stats: 2025-12-04T09:41:43.9909832Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:43.9909938Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9910027Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9910142Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9910628Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9911094Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9911568Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9912076Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9912550Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9913052Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9913532Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9914003Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9914486Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9914966Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9915300Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:43.9915480Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9915580Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9915713Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9915962Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9917362Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9917510Z graph_break [] 2025-12-04T09:41:43.9917626Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9917800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9917895Z Autotune Choices Stats: 2025-12-04T09:41:43.9918761Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9918866Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9918954Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9919062Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9919600Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9920159Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9920725Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9921288Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9921879Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9922438Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9923074Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9923544Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9924009Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9924484Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9924817Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:43.9924906Z Autotune Choices Stats: 2025-12-04T09:41:43.9925742Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:43.9925833Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9925919Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9926021Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9926501Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9926975Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9927541Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9928007Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9928472Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9928981Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9929450Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9929923Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9930393Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9930864Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9931230Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:43.9931402Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9931496Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9931632Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9931911Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9932856Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9932940Z graph_break [] 2025-12-04T09:41:43.9933043Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9933222Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9933311Z Autotune Choices Stats: 2025-12-04T09:41:43.9934138Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9934233Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9934319Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9934425Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9934894Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9935375Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9935848Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9936363Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9936848Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9937318Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9937822Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9938294Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9938764Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9939229Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9939559Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:43.9939654Z Autotune Choices Stats: 2025-12-04T09:41:43.9940477Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9941114Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9941200Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9941303Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9941820Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9942291Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9942765Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9943245Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9943711Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9944178Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9944643Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9945117Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9945590Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9946100Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9946427Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:43.9946518Z Autotune Choices Stats: 2025-12-04T09:41:43.9947363Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:43.9947458Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9947610Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9947733Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:43.9948220Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9948698Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9949165Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9949631Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9950140Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9950604Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9951110Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9951581Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9952053Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9952529Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9952862Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:43.9953034Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9953127Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9953261Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9953506Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9954888Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9954973Z graph_break [] 2025-12-04T09:41:43.9955116Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9955289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9955378Z Autotune Choices Stats: 2025-12-04T09:41:43.9956214Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9956306Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9956391Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9956496Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9957024Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9957502Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9958039Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9958519Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9958985Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9959541Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9960017Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9960527Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9961001Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9961478Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9961817Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:43.9961914Z Autotune Choices Stats: 2025-12-04T09:41:43.9962755Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:43.9962851Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9962937Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9963040Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9963524Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9964000Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9964479Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9964990Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9965457Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9965925Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9966433Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9966910Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9967388Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9967868Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9968197Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:43.9968368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9968505Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9968635Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9968879Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9970301Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9970384Z graph_break [] 2025-12-04T09:41:43.9970489Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9970661Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9970757Z Autotune Choices Stats: 2025-12-04T09:41:43.9971590Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3249", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9971685Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9971773Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9971877Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9972353Z triton_mm_3249 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9972842Z triton_mm_3251 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9973317Z triton_mm_3252 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9973796Z triton_mm_3253 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9974316Z triton_mm_3254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9974792Z triton_mm_3255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9975259Z triton_mm_3246 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9975799Z triton_mm_3244 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9976279Z triton_mm_3245 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9976749Z triton_mm_3247 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9977087Z SingleProcess AUTOTUNE benchmarking takes 0.2118 seconds and 0.6062 seconds precompiling for 15 choices 2025-12-04T09:41:43.9977198Z Autotune Choices Stats: 2025-12-04T09:41:43.9978079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3275", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:43.9978213Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9978296Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9978404Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9978885Z triton_mm_3275 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9979398Z triton_mm_3276 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9979865Z triton_mm_3277 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9980334Z triton_mm_3280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9980804Z triton_mm_3274 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9981273Z triton_mm_3279 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9981743Z triton_mm_3278 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9982216Z triton_mm_3283 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9982695Z triton_mm_3284 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9983173Z triton_mm_3281 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9983542Z SingleProcess AUTOTUNE benchmarking takes 0.1792 seconds and 1.8272 seconds precompiling for 13 choices 2025-12-04T09:41:43.9983718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:43.9983811Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:43.9983947Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:43.9984187Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:43.9985610Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:43.9985700Z graph_break [] 2025-12-04T09:41:43.9985804Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:43.9985982Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:43.9986078Z Autotune Choices Stats: 2025-12-04T09:41:43.9986923Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:43.9987028Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9987174Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9987305Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:43.9987802Z triton_mm_3300 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9988278Z triton_mm_3288 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9988807Z triton_mm_3289 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9989290Z triton_mm_3294 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9989786Z triton_mm_3295 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9990262Z triton_mm_3296 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9990745Z triton_mm_3287 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9991228Z triton_mm_3290 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9991699Z triton_mm_3291 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9992176Z triton_mm_3292 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9992507Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6016 seconds precompiling for 15 choices 2025-12-04T09:41:43.9992648Z Autotune Choices Stats: 2025-12-04T09:41:43.9993490Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3318", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028575999662280083, "best_triton_pos": 0} 2025-12-04T09:41:43.9993585Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:43.9993681Z strides: [256, 1], [256, 1] 2025-12-04T09:41:43.9993785Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:43.9994307Z triton_mm_3318 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9994790Z triton_mm_3319 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9995270Z triton_mm_3320 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:43.9995749Z triton_mm_3323 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:43.9996218Z triton_mm_3317 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:43.9996695Z triton_mm_3322 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9997206Z triton_mm_3321 0.0337 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:43.9997692Z triton_mm_3324 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9998212Z triton_mm_3326 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:43.9998690Z triton_mm_3327 0.0348 ms 82.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:43.9999031Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8359 seconds precompiling for 13 choices 2025-12-04T09:41:43.9999256Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:43.9999367Z Traceback (most recent call last): 2025-12-04T09:41:43.9999837Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.0000023Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:44.0000675Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:44.0000865Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:44.0001037Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.0001133Z Searched string: 2025-12-04T09:41:44.0001266Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.0001273Z 2025-12-04T09:41:44.0001399Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.0001407Z 2025-12-04T09:41:44.0001541Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.0001667Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.0001672Z 2025-12-04T09:41:44.0001859Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.0001949Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.0002040Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.0002135Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.0002140Z 2025-12-04T09:41:44.0002230Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.0002335Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.0002427Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.0002516Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.0002520Z 2025-12-04T09:41:44.0002524Z 2025-12-04T09:41:44.0002686Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.0002691Z 2025-12-04T09:41:44.0002695Z 2025-12-04T09:41:44.0002817Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.0002997Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.0003111Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.0003198Z idx_m = rm[:, None] 2025-12-04T09:41:44.0003292Z idx_n = rn[None, :] 2025-12-04T09:41:44.0003385Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.0003389Z 2025-12-04T09:41:44.0003486Z # inductor generates a suffix 2025-12-04T09:41:44.0003580Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.0003794Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.0003882Z ''', device_str='cuda') 2025-12-04T09:41:44.0003896Z 2025-12-04T09:41:44.0003900Z 2025-12-04T09:41:44.0004000Z async_compile.wait(globals()) 2025-12-04T09:41:44.0004082Z del async_compile 2025-12-04T09:41:44.0004086Z 2025-12-04T09:41:44.0004171Z class Runner: 2025-12-04T09:41:44.0004272Z def __init__(self, partitions): 2025-12-04T09:41:44.0004432Z self.partitions = partitions 2025-12-04T09:41:44.0004440Z 2025-12-04T09:41:44.0004556Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.0004645Z new_callables = [] 2025-12-04T09:41:44.0004776Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.0004884Z new_callables.append(fn(c)) 2025-12-04T09:41:44.0004986Z self.partitions = new_callables 2025-12-04T09:41:44.0004991Z 2025-12-04T09:41:44.0005083Z def call(self, args): 2025-12-04T09:41:44.0005226Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.0005309Z args.clear() 2025-12-04T09:41:44.0005442Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.0005566Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.0005672Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.0005777Z torch.cuda.set_device(0) 2025-12-04T09:41:44.0005946Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.0006176Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.0006275Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.0006469Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.0006561Z del arg0_1 2025-12-04T09:41:44.0006724Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.0006981Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.0007084Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.0007301Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.0007388Z del arg1_1 2025-12-04T09:41:44.0007470Z del buf0 2025-12-04T09:41:44.0007569Z return (buf1, ) 2025-12-04T09:41:44.0007575Z 2025-12-04T09:41:44.0007695Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.0007801Z call = runner.call 2025-12-04T09:41:44.0007966Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.0007971Z 2025-12-04T09:41:44.0007974Z 2025-12-04T09:41:44.0008123Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.0008329Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.0008484Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.0008686Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.0008893Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.0009007Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.0009170Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.0009175Z 2025-12-04T09:41:44.0009179Z 2025-12-04T09:41:44.0009269Z if __name__ == "__main__": 2025-12-04T09:41:44.0009483Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.0009694Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.0009826Z From CHECK: .to( 2025-12-04T09:41:44.0009833Z 2025-12-04T09:41:44.0009839Z 2025-12-04T09:41:44.0010077Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.0010852Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.0010870Z 2025-12-04T09:41:44.0011178Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.0011431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0011567Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0011744Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0012085Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0013834Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0013929Z graph_break [] 2025-12-04T09:41:44.0014104Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0014283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0014376Z Autotune Choices Stats: 2025-12-04T09:41:44.0015219Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0015320Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0015412Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0015520Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0016005Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0016478Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0016946Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0017464Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0017926Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0018442Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0018909Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0019365Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0019863Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0020332Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0020670Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.0020766Z Autotune Choices Stats: 2025-12-04T09:41:44.0021612Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0021712Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0021799Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0021911Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0022391Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0022890Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0023397Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0023858Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0024321Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0024783Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0025246Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0025713Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0026180Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0026651Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0026984Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:44.0027162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0027298Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0027428Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0027675Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0028614Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0028705Z graph_break [] 2025-12-04T09:41:44.0028810Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0028984Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0029126Z Autotune Choices Stats: 2025-12-04T09:41:44.0029948Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0030054Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0030139Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0030246Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0030726Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0031198Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0031720Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0032203Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0032719Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0033209Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0033674Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0034153Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0034621Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0035085Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0035422Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:44.0035514Z Autotune Choices Stats: 2025-12-04T09:41:44.0036339Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0036439Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0036525Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0036675Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0037145Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0037638Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0038131Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0038632Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0039094Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0039628Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0040097Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0040565Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0041086Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0041556Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0041893Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:44.0042050Z Autotune Choices Stats: 2025-12-04T09:41:44.0042879Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.0042983Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0043069Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0043186Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0043670Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0044132Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0044602Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0045069Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0045546Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0046018Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0046528Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0047009Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0047521Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0048026Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0048366Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:44.0048552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0048647Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0048778Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0049033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0049972Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0050056Z graph_break [] 2025-12-04T09:41:44.0050208Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0050384Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0050482Z Autotune Choices Stats: 2025-12-04T09:41:44.0051352Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.0051448Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0051540Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0051650Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0052135Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0052607Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0053069Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0053541Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0054009Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0054481Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0054953Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0055433Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0055937Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0056396Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0056733Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:44.0056824Z Autotune Choices Stats: 2025-12-04T09:41:44.0057695Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:44.0057794Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0057880Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0057992Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0058460Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0058930Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0059395Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0059896Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0060363Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0060862Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0061326Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0061798Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0062276Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0062749Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0063079Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:44.0063176Z Autotune Choices Stats: 2025-12-04T09:41:44.0064006Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0064114Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0064202Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0064314Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0064834Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0065305Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0065786Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0066251Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0066751Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0067226Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0067694Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0068206Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0068666Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0069232Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0069565Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:44.0069738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0069880Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0070013Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0070269Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0071645Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0071732Z graph_break [] 2025-12-04T09:41:44.0071843Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0072016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0072113Z Autotune Choices Stats: 2025-12-04T09:41:44.0072957Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0073050Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0073142Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0073249Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0073732Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0074240Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0074704Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0075167Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0075627Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0076152Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0076631Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0077115Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0077606Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0078097Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0078500Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:44.0078594Z Autotune Choices Stats: 2025-12-04T09:41:44.0079546Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0079645Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0079731Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0079841Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0080316Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0080795Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0081270Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0081742Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0082212Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0082673Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0083153Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0083628Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0084159Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0084630Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0084958Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:44.0085176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0085277Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0085416Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0085664Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0086608Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0086698Z graph_break [] 2025-12-04T09:41:44.0086802Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0086974Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0087078Z Autotune Choices Stats: 2025-12-04T09:41:44.0087930Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.0088074Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0088161Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0088266Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0088792Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0089269Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0089744Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0090220Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0090696Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0091167Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0091633Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0092114Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0092589Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0093115Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0093447Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:44.0093540Z Autotune Choices Stats: 2025-12-04T09:41:44.0094373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0094508Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0094602Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0094706Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0095177Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0095660Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0096133Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0096608Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0097119Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0097621Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0098145Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0098619Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0099098Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0099574Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0099908Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:44.0100000Z Autotune Choices Stats: 2025-12-04T09:41:44.0101198Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0101298Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0101384Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0101502Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0101986Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0102463Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0103027Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0103496Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0103973Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0104500Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0104981Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0105456Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0105926Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0106404Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0106789Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:44.0106969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0107078Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0107228Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0107497Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0108493Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0108584Z graph_break [] 2025-12-04T09:41:44.0108689Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0113230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0113359Z Autotune Choices Stats: 2025-12-04T09:41:44.0114205Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0114317Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0114408Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0114520Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0115008Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0115489Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0115979Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0116458Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0117014Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0117523Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0118022Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0118582Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0119056Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0119605Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0119940Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.0120036Z Autotune Choices Stats: 2025-12-04T09:41:44.0120877Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0121018Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0121115Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0121225Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0121741Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0122225Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0122699Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0123184Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0123658Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0124137Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0124610Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0125091Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0125574Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0126067Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0126609Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:44.0126745Z Autotune Choices Stats: 2025-12-04T09:41:44.0127827Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0127928Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0128017Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0128184Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0128664Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0129141Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0129627Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0130108Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0130594Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0131116Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0131703Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0132174Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0132650Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0133130Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0133463Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:44.0133648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0133745Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0133878Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0134129Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0135529Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0135625Z graph_break [] 2025-12-04T09:41:44.0135729Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0135952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0136051Z Autotune Choices Stats: 2025-12-04T09:41:44.0136891Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0136992Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0137079Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0137185Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0137757Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0138238Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0138731Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0139245Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0139726Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0140217Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0140739Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0141270Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0141752Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0142229Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0142580Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:44.0142716Z Autotune Choices Stats: 2025-12-04T09:41:44.0143891Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0144018Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0144112Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0144217Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0144697Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0145181Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0145654Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0146190Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0146668Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0147150Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0147657Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0148140Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0148627Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0149103Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0149438Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:44.0149613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0149752Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0149894Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0150140Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0151572Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0151660Z graph_break [] 2025-12-04T09:41:44.0151764Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0151943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0152034Z Autotune Choices Stats: 2025-12-04T09:41:44.0152898Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0152994Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0153082Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0153192Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0153683Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0154165Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0154639Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0155112Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0155654Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0156125Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0156606Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0157195Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0157713Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0158190Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0158524Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:44.0158621Z Autotune Choices Stats: 2025-12-04T09:41:44.0159532Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0159685Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0159772Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0159883Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0160368Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0160880Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0161359Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0161835Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0162310Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0162781Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0163260Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0163738Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0164215Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0164707Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0165077Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:44.0165258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0165354Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0165490Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0165738Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0167183Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0167292Z graph_break [] 2025-12-04T09:41:44.0167411Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0167584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0167683Z Autotune Choices Stats: 2025-12-04T09:41:44.0168521Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.0168621Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0168707Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0168855Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0169339Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0169816Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0170332Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0170810Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0171285Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0171771Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0172250Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0172731Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0173199Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0173677Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0174014Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:44.0174147Z Autotune Choices Stats: 2025-12-04T09:41:44.0175003Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0175098Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0175187Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0175291Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0175765Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0176293Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0176770Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0177252Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0177722Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0178199Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0178713Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0179190Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0179706Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0180182Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0180515Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:44.0180693Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0180789Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0180926Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0181175Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0182555Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0182638Z graph_break [] 2025-12-04T09:41:44.0182744Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0182926Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0183019Z Autotune Choices Stats: 2025-12-04T09:41:44.0183861Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0183994Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0184083Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0184192Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0184668Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0185148Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0185662Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0186149Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0186635Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0187114Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0187605Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0188177Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0188660Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0189196Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0189528Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:44.0189627Z Autotune Choices Stats: 2025-12-04T09:41:44.0190463Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.0190563Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0190652Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0190755Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0191237Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0191706Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0192178Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0192658Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0193126Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0193645Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0194116Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0194597Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0195112Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0195596Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0195932Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:44.0196105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0196203Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0196335Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0196586Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0198020Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0198145Z graph_break [] 2025-12-04T09:41:44.0198255Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0198467Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0198567Z Autotune Choices Stats: 2025-12-04T09:41:44.0199415Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0199588Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0199700Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0199804Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0200633Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0201282Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0201755Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0202231Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0202703Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0203273Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0203747Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0204223Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0204695Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0205229Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0205573Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:44.0205668Z Autotune Choices Stats: 2025-12-04T09:41:44.0206515Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0206609Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0206697Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0206809Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0207290Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0207829Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0208354Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0208820Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0209292Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0209762Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0210232Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0210708Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0211183Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0211651Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0211985Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:44.0212162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0212302Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0212437Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0212682Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0213626Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0213711Z graph_break [] 2025-12-04T09:41:44.0213813Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0213991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0214123Z Autotune Choices Stats: 2025-12-04T09:41:44.0214963Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.0215067Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0215153Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0215260Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0215740Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0216206Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0216718Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0217187Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0217732Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0218228Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0218701Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0219188Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0219664Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0220139Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0220467Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:44.0220564Z Autotune Choices Stats: 2025-12-04T09:41:44.0221396Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0221492Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0221626Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0221734Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0222208Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0222684Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0223149Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0223715Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0224183Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0224658Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0225123Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0225593Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0226140Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0226614Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0226988Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:44.0227081Z Autotune Choices Stats: 2025-12-04T09:41:44.0227989Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0228082Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0228171Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0228289Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0228766Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0229246Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0229718Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0230194Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0230676Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0231153Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0231679Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0232155Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0232640Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0233161Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0233492Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:44.0233672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0233766Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0233902Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0234143Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0235086Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0235216Z graph_break [] 2025-12-04T09:41:44.0235320Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0235493Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0235592Z Autotune Choices Stats: 2025-12-04T09:41:44.0236454Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0236553Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0236637Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0236740Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0237218Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0237698Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0238176Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0238663Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0239136Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0239690Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0240162Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0240679Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0241146Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0241615Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0241941Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.0242037Z Autotune Choices Stats: 2025-12-04T09:41:44.0242910Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0243013Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0243102Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0243208Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0243679Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0244153Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0244620Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0245131Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0245634Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0246103Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0246567Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0247041Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0247523Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0248043Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0248377Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:44.0248470Z Autotune Choices Stats: 2025-12-04T09:41:44.0249303Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0249403Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0249489Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0249642Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0250113Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0250587Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0251066Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0251579Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0252070Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0252564Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0253054Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0253529Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0254037Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0254509Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0254839Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:44.0255060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0255161Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0255294Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0255544Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0256488Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0256580Z graph_break [] 2025-12-04T09:41:44.0256686Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0256859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0256958Z Autotune Choices Stats: 2025-12-04T09:41:44.0257820Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0257938Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0258032Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0258136Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0258627Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0259106Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0259629Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0260109Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0260592Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0261141Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0261611Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0262090Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0262554Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0263025Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0263396Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.0263490Z Autotune Choices Stats: 2025-12-04T09:41:44.0264359Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.0264455Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0264546Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0264648Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0265122Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0265601Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0266070Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0266549Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0267016Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0267480Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0267958Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0268431Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0268958Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0269430Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0269761Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:44.0269856Z Autotune Choices Stats: 2025-12-04T09:41:44.0270731Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0270833Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0270919Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0271038Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0271515Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0271983Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0272459Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0272968Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0273442Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0273945Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0274423Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0274898Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0275370Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0275855Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0276179Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:44.0276357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0276452Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0276584Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0276837Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0278254Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0278389Z graph_break [] 2025-12-04T09:41:44.0278493Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0278666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0278763Z Autotune Choices Stats: 2025-12-04T09:41:44.0279680Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0279787Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0279875Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0279982Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0280463Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0280943Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0281433Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0281909Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0282419Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0282893Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0283408Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0283884Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0284358Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0284844Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0285178Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:44.0285273Z Autotune Choices Stats: 2025-12-04T09:41:44.0286137Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0286230Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0286325Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0286434Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0286913Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0287432Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0287962Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0288443Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0288949Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0289424Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0289901Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0290376Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0290857Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0291332Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0291707Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:44.0291884Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0291980Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0292118Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0292405Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0293789Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0293874Z graph_break [] 2025-12-04T09:41:44.0293979Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0294160Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0294255Z Autotune Choices Stats: 2025-12-04T09:41:44.0295107Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.0295202Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0295291Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0295400Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0295878Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0296362Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0296925Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0297396Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0297870Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0298383Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0298866Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0299340Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0299817Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0300611Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0300963Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:44.0301147Z Autotune Choices Stats: 2025-12-04T09:41:44.0301987Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0302090Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0302230Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0302336Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0302812Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0303278Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0303756Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0304230Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0304699Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0305171Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0305639Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0306119Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0306645Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0307125Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0307505Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.0307680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0307780Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0307914Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0308218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0309919Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0310066Z graph_break [] 2025-12-04T09:41:44.0310219Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0310414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0310516Z Autotune Choices Stats: 2025-12-04T09:41:44.0311356Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0311521Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0311608Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0311714Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0312238Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0312719Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0313199Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0313678Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0314170Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0314646Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0315119Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0315596Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0316068Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0316579Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0316914Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:44.0317011Z Autotune Choices Stats: 2025-12-04T09:41:44.0317906Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0318043Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0318137Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0318243Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0318719Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0319199Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0319734Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0320206Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0320721Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0321193Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0321699Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0322172Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0322652Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0323128Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0323466Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:44.0323638Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0323733Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0323872Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0324118Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0325500Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0325588Z graph_break [] 2025-12-04T09:41:44.0325743Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0325915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0326007Z Autotune Choices Stats: 2025-12-04T09:41:44.0326848Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0326940Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0327028Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0327140Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0327663Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0328153Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0328638Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0329120Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0329598Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0330118Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0330596Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0331135Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0331609Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0332076Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0332418Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.0332511Z Autotune Choices Stats: 2025-12-04T09:41:44.0333354Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0333453Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0333538Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0333642Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0334125Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0334602Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0335083Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0335598Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0336075Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0336538Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0337043Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0337523Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0338050Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0338528Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0338853Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:44.0339031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0339172Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0339303Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0339556Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0340537Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0340623Z graph_break [] 2025-12-04T09:41:44.0340733Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0340905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0341001Z Autotune Choices Stats: 2025-12-04T09:41:44.0341846Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.0341943Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0342035Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0342139Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0342633Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0343102Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0343570Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0344057Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0344590Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0345266Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0346429Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0347517Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0348614Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0349659Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0350568Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:44.0351082Z Autotune Choices Stats: 2025-12-04T09:41:44.0352066Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0353333Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0353613Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0353889Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0354669Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0355966Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0357213Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0358437Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0359720Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0360984Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0362251Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0363469Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0365208Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0367015Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0368652Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:44.0369545Z Autotune Choices Stats: 2025-12-04T09:41:44.0371042Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0372781Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0373216Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0373657Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0374848Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0376509Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0378271Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0379562Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0380618Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0381671Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0382835Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0390568Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0391651Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0392696Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0393609Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:44.0394222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0394604Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0394919Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0395398Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0397140Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0398740Z graph_break [] 2025-12-04T09:41:44.0398986Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0399367Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0399865Z Autotune Choices Stats: 2025-12-04T09:41:44.0401227Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.0402378Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0402639Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0402905Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0403598Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0404738Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0405807Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0406860Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0407961Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0409012Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0410178Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0411229Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0412337Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0413408Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0414333Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:44.0414868Z Autotune Choices Stats: 2025-12-04T09:41:44.0415874Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0416910Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0417174Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0417443Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0418120Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0419180Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0420235Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0421285Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0422430Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0423465Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0424503Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0425584Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0426649Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0427721Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0428661Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:44.0429270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0429657Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0429971Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0430456Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0432236Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0433824Z graph_break [] 2025-12-04T09:41:44.0434069Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0434451Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0434832Z Autotune Choices Stats: 2025-12-04T09:41:44.0435844Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0436887Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0437152Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0437442Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0438161Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0439227Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0440349Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0441408Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0442463Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0443562Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0444597Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0445637Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0446711Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0447752Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0448737Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:44.0449513Z Autotune Choices Stats: 2025-12-04T09:41:44.0450960Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0452591Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0452995Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0453505Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0454631Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0456365Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0458172Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0459672Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0460737Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0462040Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0463129Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0464344Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0465513Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0466736Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0467815Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:44.0468629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0469159Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0469487Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0469973Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0472057Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0473930Z graph_break [] 2025-12-04T09:41:44.0474168Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0474547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0475039Z Autotune Choices Stats: 2025-12-04T09:41:44.0476208Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0477235Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0477600Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0477972Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0478731Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0480041Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0481318Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0482598Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0483814Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0485158Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0486206Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0487552Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0489015Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0490376Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0491578Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:44.0492231Z Autotune Choices Stats: 2025-12-04T09:41:44.0493234Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0494342Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0494598Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0494871Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0495555Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0496617Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0497802Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0498858Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0499904Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0501247Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0502279Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0503433Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0504482Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0505597Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0506505Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:44.0507103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0507486Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0507802Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0508279Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0510004Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0511558Z graph_break [] 2025-12-04T09:41:44.0511792Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0512173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0512542Z Autotune Choices Stats: 2025-12-04T09:41:44.0513540Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0514818Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0515178Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0515515Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0516204Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0517299Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0518346Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0519547Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0520593Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0521639Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0522686Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0523740Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0524843Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0525905Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0526872Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:44.0527398Z Autotune Choices Stats: 2025-12-04T09:41:44.0528429Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0529456Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0529718Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0529976Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0530652Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0531718Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0532766Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0533817Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0534853Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0535886Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0536969Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0538003Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0539045Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0540127Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0541041Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:44.0541642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0542023Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0542326Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0542799Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0544090Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0545253Z graph_break [] 2025-12-04T09:41:44.0545480Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0545856Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0546237Z Autotune Choices Stats: 2025-12-04T09:41:44.0547259Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0548329Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0548586Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0548851Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0549520Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0550590Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0551645Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0552696Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0553749Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0554814Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0555878Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0556965Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0558054Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0559097Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0560085Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:44.0560652Z Autotune Choices Stats: 2025-12-04T09:41:44.0561643Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0562659Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0562913Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0563183Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0563858Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0564901Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0566006Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0567054Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0568130Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0569170Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0570194Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0571257Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0572319Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0573369Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0574271Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:44.0574791Z Autotune Choices Stats: 2025-12-04T09:41:44.0575770Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0576793Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0577147Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0577436Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0578119Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0579177Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0580229Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0581316Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0582376Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0583437Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0584496Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0585552Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0586675Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0587768Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0588750Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:44.0589354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0589729Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0590039Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0590520Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0592263Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0593811Z graph_break [] 2025-12-04T09:41:44.0594043Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0594424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0594800Z Autotune Choices Stats: 2025-12-04T09:41:44.0595778Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0596797Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0597059Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0597328Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0597993Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0599102Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0600224Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0601565Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0602696Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0603748Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0604808Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0605855Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0606901Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0608061Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0608963Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:44.0609486Z Autotune Choices Stats: 2025-12-04T09:41:44.0610533Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0611557Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0611821Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0612084Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0612765Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0613824Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0614881Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0615934Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0616992Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0618089Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0619125Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0620238Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0621286Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0622343Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0623295Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:44.0623896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0624281Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0624589Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0625064Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0626364Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0627490Z graph_break [] 2025-12-04T09:41:44.0627726Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0628098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0628523Z Autotune Choices Stats: 2025-12-04T09:41:44.0629510Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0630533Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0630790Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0631096Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0631770Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0632826Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0633884Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0634934Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0636006Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0637086Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0638216Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0639276Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0640454Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0641513Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0642423Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:44.0642937Z Autotune Choices Stats: 2025-12-04T09:41:44.0643985Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0645011Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0645272Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0645531Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0646207Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0647257Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0648449Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0649491Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0650603Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0651719Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0652757Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0653800Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0654848Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0655892Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0656818Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.0657345Z Autotune Choices Stats: 2025-12-04T09:41:44.0658325Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:44.0659339Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0659605Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0659881Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0660552Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0661660Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0662704Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0663746Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0664813Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0665861Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0666898Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0667995Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0669043Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0670095Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0671039Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.0671648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0672029Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0672334Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0672860Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0674590Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0676172Z graph_break [] 2025-12-04T09:41:44.0676411Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0676785Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0677156Z Autotune Choices Stats: 2025-12-04T09:41:44.0678197Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0679214Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0679466Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0679835Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0680518Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0681579Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0682686Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0683744Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0684783Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0685862Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0686903Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0687940Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0688983Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0690018Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0690926Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:44.0691487Z Autotune Choices Stats: 2025-12-04T09:41:44.0692478Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0693503Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0693800Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0694058Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0694735Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0695797Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0696858Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0697942Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0698982Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0700008Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0701396Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0702464Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0703594Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0704651Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0705563Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:44.0706169Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0706542Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0706911Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0707444Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0709170Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0710741Z graph_break [] 2025-12-04T09:41:44.0717443Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0717851Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0718243Z Autotune Choices Stats: 2025-12-04T09:41:44.0719245Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0720457Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0720727Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0720995Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0721740Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0722817Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0723896Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0724961Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0726009Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0727065Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0728173Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0729218Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0730266Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0731363Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0732283Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:44.0732817Z Autotune Choices Stats: 2025-12-04T09:41:44.0733812Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0734886Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0735156Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0735423Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0736111Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0737169Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0738253Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0739296Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0740405Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0741439Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0742518Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0743566Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0744616Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0745668Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0746573Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:44.0747187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0747577Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0747899Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0748380Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0750127Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0751715Z graph_break [] 2025-12-04T09:41:44.0751957Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0752335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0752718Z Autotune Choices Stats: 2025-12-04T09:41:44.0753720Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.0754745Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0755006Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0755282Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0756014Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0757081Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0758167Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0759216Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0760329Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0761431Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0762486Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0763648Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0764721Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0765773Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0766684Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:44.0767205Z Autotune Choices Stats: 2025-12-04T09:41:44.0768206Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0769244Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0769509Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0769771Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0770457Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0771530Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0772582Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0773663Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0774704Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0775739Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0776815Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0777908Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0778963Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0780012Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0780924Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:44.0781579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0781959Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0782270Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0782751Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0784092Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0785221Z graph_break [] 2025-12-04T09:41:44.0785460Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0785843Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0786217Z Autotune Choices Stats: 2025-12-04T09:41:44.0787208Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0788245Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0788508Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0788772Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0789455Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0790510Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0791567Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0792624Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0793732Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0794792Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0795844Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0796929Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0798023Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0799062Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0800017Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.0800807Z Autotune Choices Stats: 2025-12-04T09:41:44.0801795Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0802888Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0803152Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0803420Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0804094Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0805210Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0806261Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0806739Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0807223Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0807693Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0808172Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0808649Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0809127Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0809613Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0810010Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:44.0810105Z Autotune Choices Stats: 2025-12-04T09:41:44.0810951Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0811049Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0811140Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0811260Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0811813Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0812294Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0812775Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0813261Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0813735Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0814259Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0814742Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0815263Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0815740Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0816218Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0816557Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:44.0816740Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0816841Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0816975Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0817226Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0818226Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0818313Z graph_break [] 2025-12-04T09:41:44.0818420Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0818599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0818704Z Autotune Choices Stats: 2025-12-04T09:41:44.0819546Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.0819688Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0819783Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0819892Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0820383Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0820850Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0821358Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0821843Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0822326Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0822812Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0823283Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0823794Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0824265Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0824778Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0825118Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:44.0825214Z Autotune Choices Stats: 2025-12-04T09:41:44.0826070Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0826171Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0826264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0826378Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0826857Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0827343Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0827844Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0828354Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0828831Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0829344Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0829814Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0830291Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0830818Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0831295Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0831643Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:44.0831743Z Autotune Choices Stats: 2025-12-04T09:41:44.0832576Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0832677Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0832810Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0832924Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0833407Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0833881Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0834398Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0834889Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0835381Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0835863Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0836341Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0836832Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0837301Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0837820Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0838161Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:44.0838382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0838477Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0838611Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0838867Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0840333Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0840425Z graph_break [] 2025-12-04T09:41:44.0840534Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0840709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0840803Z Autotune Choices Stats: 2025-12-04T09:41:44.0841638Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.0841732Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0841824Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0841929Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0842410Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0842918Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0843433Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0843914Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0844391Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0844877Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0845346Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0845826Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0846291Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0846762Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0847099Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:44.0847193Z Autotune Choices Stats: 2025-12-04T09:41:44.0848080Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0848250Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0848336Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0848446Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0848921Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0849432Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0849909Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0850385Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0850855Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0851320Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0851792Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0852306Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0852825Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0853296Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0853630Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:44.0853803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0853899Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0854035Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0854282Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0855228Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0855316Z graph_break [] 2025-12-04T09:41:44.0855420Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0855596Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0855688Z Autotune Choices Stats: 2025-12-04T09:41:44.0856513Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0856615Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0856743Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0856852Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0857327Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0857808Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.0858290Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0858801Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0859273Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0859743Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0860207Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0860678Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0861190Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0861664Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0862035Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.0862132Z Autotune Choices Stats: 2025-12-04T09:41:44.0862956Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:44.0863048Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0863144Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0863251Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0863723Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0864197Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0864666Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0865137Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0865618Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0866088Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0866602Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0867079Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0867603Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0868120Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0868454Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:44.0868558Z Autotune Choices Stats: 2025-12-04T09:41:44.0869388Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0869485Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0869571Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0869682Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0870161Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0870671Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0871150Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0871657Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0872135Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0872618Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0873088Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0873570Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0874037Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0874519Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0874853Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:44.0875036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0875135Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0875306Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0875552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0876934Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0877017Z graph_break [] 2025-12-04T09:41:44.0877128Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0877342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0877440Z Autotune Choices Stats: 2025-12-04T09:41:44.0878343Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0878441Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0878533Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0878636Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0879120Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0879647Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0880156Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0880638Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0881174Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0881665Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0882143Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0882620Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0883089Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0883557Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0883887Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:44.0883979Z Autotune Choices Stats: 2025-12-04T09:41:44.0884821Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0884957Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0885044Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0885153Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0885638Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0886115Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0886586Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0887092Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0887566Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0888032Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0888502Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0888977Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0889491Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0889966Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0890330Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:44.0890515Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0890610Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0890750Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0890996Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0891940Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0892029Z graph_break [] 2025-12-04T09:41:44.0892133Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0892314Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0892407Z Autotune Choices Stats: 2025-12-04T09:41:44.0893241Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0893340Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0893427Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0893536Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0894027Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0894547Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0895018Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0895484Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0896000Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0896490Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0896966Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0897442Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0897957Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0898429Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0898799Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:44.0898898Z Autotune Choices Stats: 2025-12-04T09:41:44.0899764Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0899860Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0899959Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0900064Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0900684Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0901175Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0901657Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0902135Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0902604Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0903081Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0903554Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0904100Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0904583Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0905057Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0905399Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:44.0905556Z Autotune Choices Stats: 2025-12-04T09:41:44.0906388Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.0906488Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0906576Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0906698Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0907199Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0907699Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0908227Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0908704Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0909238Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0909725Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0910206Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0910684Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0911159Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0911630Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0911962Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:44.0912147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0912245Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0912382Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0912635Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0914025Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0914159Z graph_break [] 2025-12-04T09:41:44.0914265Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0914445Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0914540Z Autotune Choices Stats: 2025-12-04T09:41:44.0915424Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0915528Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0915615Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0915721Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0916215Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0916690Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0917180Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0917737Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0918225Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0918743Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0919227Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0919781Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0920336Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0920898Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0921284Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.0921385Z Autotune Choices Stats: 2025-12-04T09:41:44.0922386Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0922488Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0922586Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0922697Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0923261Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0923865Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0924424Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0924979Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0925574Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0926133Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0926687Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0927276Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0927859Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0928456Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0928847Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:44.0929047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0929150Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0929329Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0929615Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0930778Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0930873Z graph_break [] 2025-12-04T09:41:44.0930987Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0931192Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0931290Z Autotune Choices Stats: 2025-12-04T09:41:44.0932292Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.0932388Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0932477Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0932596Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0933166Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0933731Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0934330Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0934890Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0935453Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0936053Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0936617Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0937176Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0937787Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0938342Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0938730Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:44.0938933Z Autotune Choices Stats: 2025-12-04T09:41:44.0939930Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0940034Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0940125Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0940273Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0940841Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0941399Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0941965Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0942522Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0943081Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0943637Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0944187Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0944759Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0945876Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0946365Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0946698Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:44.0946791Z Autotune Choices Stats: 2025-12-04T09:41:44.0947684Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.0947786Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0947884Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0948003Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0948487Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0948968Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0949438Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0949923Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0950437Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0950972Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0951450Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0951924Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0952403Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0952871Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0953210Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:44.0953387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0953481Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0953618Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0953867Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0954813Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0954901Z graph_break [] 2025-12-04T09:41:44.0955049Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0955230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0955325Z Autotune Choices Stats: 2025-12-04T09:41:44.0956169Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0956264Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0956350Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0956461Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0957011Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0957499Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0958025Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0958495Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0958966Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0959525Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0960000Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0960504Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0960976Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0961452Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0961788Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:44.0961887Z Autotune Choices Stats: 2025-12-04T09:41:44.0962715Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.0962815Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0962901Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0963008Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0963486Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0963965Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0964454Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0964973Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0965448Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0965917Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0966426Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0966904Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0967376Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0967900Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0968230Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:44.0968322Z Autotune Choices Stats: 2025-12-04T09:41:44.0969204Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:44.0969300Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0969395Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0969506Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.0970025Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0970496Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0970964Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0971437Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0971907Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0972379Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0972851Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0973322Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0973802Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0974315Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0974643Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:44.0974817Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0974914Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0975052Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0975337Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0976718Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0976806Z graph_break [] 2025-12-04T09:41:44.0976913Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0977092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0977184Z Autotune Choices Stats: 2025-12-04T09:41:44.0978021Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0978156Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0978243Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0978357Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0978833Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0979356Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0979997Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0980667Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0981329Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0981999Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0982719Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0983356Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0984020Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0984678Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0985238Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:44.0985380Z Autotune Choices Stats: 2025-12-04T09:41:44.0986531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:44.0986669Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0986791Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0986993Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.0987668Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0988380Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0989085Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0990008Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.0990786Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.0991528Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0992284Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.0993099Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0993775Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.0994480Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.0994976Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:44.0995250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.0995389Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.0995575Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.0995939Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.0996891Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.0996984Z graph_break [] 2025-12-04T09:41:44.0997108Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.0997312Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.0997422Z Autotune Choices Stats: 2025-12-04T09:41:44.0998256Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.0998460Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.0998548Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.0998656Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.0999143Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.0999756Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1000486Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1001049Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1001535Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1002022Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1002591Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1003065Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1003592Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1004077Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1004414Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.1004510Z Autotune Choices Stats: 2025-12-04T09:41:44.1005355Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1005453Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1005545Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1005650Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1006126Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1006607Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1007085Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1007567Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1008094Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1008566Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1009046Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1009585Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1010073Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1010552Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1010886Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:44.1010985Z Autotune Choices Stats: 2025-12-04T09:41:44.1011837Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.1011983Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1012074Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1012194Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1012674Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1019148Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1019655Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1020148Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1020623Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1021097Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1021578Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1022052Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1022534Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1023011Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1023388Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:44.1023575Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1023674Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1023811Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1024059Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1025499Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1025591Z graph_break [] 2025-12-04T09:41:44.1025698Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1025890Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1025985Z Autotune Choices Stats: 2025-12-04T09:41:44.1026839Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1026935Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1027023Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1027181Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1027693Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1028203Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1028721Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1029202Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1029684Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1030165Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1030650Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1031123Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1031603Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1032088Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1032429Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:44.1032573Z Autotune Choices Stats: 2025-12-04T09:41:44.1033430Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1033535Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1033625Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1033731Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1034219Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1034739Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1035227Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1035699Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1036168Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1036642Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1037152Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1037634Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1038174Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1038656Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1038986Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:44.1039168Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1039269Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1039402Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1039727Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1041125Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1041211Z graph_break [] 2025-12-04T09:41:44.1041323Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1041498Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1041604Z Autotune Choices Stats: 2025-12-04T09:41:44.1042443Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3249", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1042583Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1042681Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1042788Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1043267Z triton_mm_3249 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1043755Z triton_mm_3251 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1044273Z triton_mm_3252 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1044766Z triton_mm_3253 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1045249Z triton_mm_3254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1045733Z triton_mm_3255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1046204Z triton_mm_3246 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1046712Z triton_mm_3244 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1047194Z triton_mm_3245 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1047752Z triton_mm_3247 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1048092Z SingleProcess AUTOTUNE benchmarking takes 0.2118 seconds and 0.6062 seconds precompiling for 15 choices 2025-12-04T09:41:44.1048185Z Autotune Choices Stats: 2025-12-04T09:41:44.1049039Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3275", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1049141Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1049234Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1049345Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1049831Z triton_mm_3275 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1050312Z triton_mm_3276 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1050782Z triton_mm_3277 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1051256Z triton_mm_3280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1051731Z triton_mm_3274 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1052243Z triton_mm_3279 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1052715Z triton_mm_3278 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1053191Z triton_mm_3283 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1053711Z triton_mm_3284 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1054189Z triton_mm_3281 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1054522Z SingleProcess AUTOTUNE benchmarking takes 0.1792 seconds and 1.8272 seconds precompiling for 13 choices 2025-12-04T09:41:44.1054702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1054797Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1054929Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1055183Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1056568Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1056700Z graph_break [] 2025-12-04T09:41:44.1056805Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1057025Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1057121Z Autotune Choices Stats: 2025-12-04T09:41:44.1058026Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.1058131Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1058218Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1058326Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1058824Z triton_mm_3300 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1059307Z triton_mm_3288 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1059785Z triton_mm_3289 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1060263Z triton_mm_3294 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1060749Z triton_mm_3295 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1061222Z triton_mm_3296 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1061737Z triton_mm_3287 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1062209Z triton_mm_3290 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1062678Z triton_mm_3291 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1063193Z triton_mm_3292 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1063529Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6016 seconds precompiling for 15 choices 2025-12-04T09:41:44.1063629Z Autotune Choices Stats: 2025-12-04T09:41:44.1064474Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3318", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028575999662280083, "best_triton_pos": 0} 2025-12-04T09:41:44.1064570Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1064663Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1064769Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1065252Z triton_mm_3318 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1065768Z triton_mm_3319 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1066281Z triton_mm_3320 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1066757Z triton_mm_3323 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1067227Z triton_mm_3317 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1067706Z triton_mm_3322 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1068177Z triton_mm_3321 0.0337 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1068660Z triton_mm_3324 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1069150Z triton_mm_3326 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1069625Z triton_mm_3327 0.0348 ms 82.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1069966Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8359 seconds precompiling for 13 choices 2025-12-04T09:41:44.1070142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1070285Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1070419Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1070666Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1072055Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1072141Z graph_break [] 2025-12-04T09:41:44.1072297Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1072474Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1072568Z Autotune Choices Stats: 2025-12-04T09:41:44.1073418Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3344", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.1073512Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1073600Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1073712Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1074201Z triton_mm_3344 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1074675Z triton_mm_3332 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1075214Z triton_mm_3333 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1075732Z triton_mm_3339 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1076207Z triton_mm_3341 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1076683Z triton_mm_3331 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1077194Z triton_mm_3338 0.0285 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1077685Z triton_mm_3330 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1078165Z triton_mm_3334 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1078632Z triton_mm_3335 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1078968Z SingleProcess AUTOTUNE benchmarking takes 0.2046 seconds and 0.6449 seconds precompiling for 15 choices 2025-12-04T09:41:44.1079062Z Autotune Choices Stats: 2025-12-04T09:41:44.1079947Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.1080089Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1080178Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1080284Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1080763Z triton_mm_3366 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1081227Z triton_mm_3360 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1081744Z triton_mm_3361 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1082215Z triton_mm_3362 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1082692Z triton_mm_3363 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1083159Z triton_mm_3365 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1083631Z triton_mm_3367 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1084104Z triton_mm_3364 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1084616Z triton_mm_3369 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1085132Z triton_mm_3370 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1085462Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.9040 seconds precompiling for 13 choices 2025-12-04T09:41:44.1085694Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:44.1085800Z Traceback (most recent call last): 2025-12-04T09:41:44.1086217Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.1086412Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:44.1086759Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:44.1086942Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:44.1087119Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.1087222Z Searched string: 2025-12-04T09:41:44.1087386Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.1087392Z 2025-12-04T09:41:44.1087509Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.1087514Z 2025-12-04T09:41:44.1087643Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.1087774Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.1087779Z 2025-12-04T09:41:44.1087874Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.1087965Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.1088068Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.1088159Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.1088163Z 2025-12-04T09:41:44.1088255Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.1088347Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.1088487Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.1088581Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.1088586Z 2025-12-04T09:41:44.1088590Z 2025-12-04T09:41:44.1088747Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.1088752Z 2025-12-04T09:41:44.1088756Z 2025-12-04T09:41:44.1088884Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.1089001Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.1089114Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.1089205Z idx_m = rm[:, None] 2025-12-04T09:41:44.1089292Z idx_n = rn[None, :] 2025-12-04T09:41:44.1089386Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.1089393Z 2025-12-04T09:41:44.1089539Z # inductor generates a suffix 2025-12-04T09:41:44.1089635Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.1089855Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.1089952Z ''', device_str='cuda') 2025-12-04T09:41:44.1089956Z 2025-12-04T09:41:44.1089960Z 2025-12-04T09:41:44.1090060Z async_compile.wait(globals()) 2025-12-04T09:41:44.1090149Z del async_compile 2025-12-04T09:41:44.1090153Z 2025-12-04T09:41:44.1090239Z class Runner: 2025-12-04T09:41:44.1090342Z def __init__(self, partitions): 2025-12-04T09:41:44.1090452Z self.partitions = partitions 2025-12-04T09:41:44.1090456Z 2025-12-04T09:41:44.1090567Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.1090664Z new_callables = [] 2025-12-04T09:41:44.1090782Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.1090888Z new_callables.append(fn(c)) 2025-12-04T09:41:44.1091045Z self.partitions = new_callables 2025-12-04T09:41:44.1091052Z 2025-12-04T09:41:44.1091141Z def call(self, args): 2025-12-04T09:41:44.1091231Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.1091324Z args.clear() 2025-12-04T09:41:44.1091456Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.1091583Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.1091694Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.1091791Z torch.cuda.set_device(0) 2025-12-04T09:41:44.1092002Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.1092223Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.1092323Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.1092518Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.1092603Z del arg0_1 2025-12-04T09:41:44.1092770Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.1093033Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.1093136Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.1093373Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.1093458Z del arg1_1 2025-12-04T09:41:44.1093536Z del buf0 2025-12-04T09:41:44.1093625Z return (buf1, ) 2025-12-04T09:41:44.1093632Z 2025-12-04T09:41:44.1093733Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.1093821Z call = runner.call 2025-12-04T09:41:44.1093985Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.1093990Z 2025-12-04T09:41:44.1093993Z 2025-12-04T09:41:44.1094133Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.1094268Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.1094416Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.1094623Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.1094827Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.1094974Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.1095138Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.1095142Z 2025-12-04T09:41:44.1095154Z 2025-12-04T09:41:44.1095246Z if __name__ == "__main__": 2025-12-04T09:41:44.1095450Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.1095616Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.1095701Z From CHECK: .to( 2025-12-04T09:41:44.1095706Z 2025-12-04T09:41:44.1095709Z 2025-12-04T09:41:44.1095883Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.1096537Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.1096544Z 2025-12-04T09:41:44.1096767Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.1096953Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1097049Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1097205Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1097486Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1098872Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1099002Z graph_break [] 2025-12-04T09:41:44.1099108Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1099283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1099383Z Autotune Choices Stats: 2025-12-04T09:41:44.1100454Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1100612Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1100737Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1100884Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1101562Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1102042Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1102512Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1102974Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1103430Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1103905Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1104365Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1104928Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1105384Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1105854Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1106190Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.1106371Z Autotune Choices Stats: 2025-12-04T09:41:44.1107245Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1107371Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1107467Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1107575Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1108053Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1108517Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1109031Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1109492Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1110005Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1110462Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1110923Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1111395Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1111864Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1112332Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1112663Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:44.1112841Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1112938Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1113075Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1113329Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1114277Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1114406Z graph_break [] 2025-12-04T09:41:44.1114513Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1114693Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1114786Z Autotune Choices Stats: 2025-12-04T09:41:44.1115615Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1115755Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1115850Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1115961Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1116434Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1116909Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1117385Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1117897Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1118434Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1118919Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1119426Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1119986Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1120462Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1120941Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1121274Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:44.1121370Z Autotune Choices Stats: 2025-12-04T09:41:44.1122203Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1122299Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1122391Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1122495Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1122975Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1123440Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1123958Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1124424Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1124885Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1125398Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1125862Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1126340Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1126813Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1127280Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1127655Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:44.1127749Z Autotune Choices Stats: 2025-12-04T09:41:44.1128591Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.1128730Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1128820Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1128936Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1129410Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1129883Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1130346Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1130821Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1131304Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1131777Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1132247Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1132717Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1133233Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1133693Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1134029Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:44.1134211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1134348Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1134484Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1134731Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1135677Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1135768Z graph_break [] 2025-12-04T09:41:44.1135872Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1136053Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1136145Z Autotune Choices Stats: 2025-12-04T09:41:44.1136975Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.1137115Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1137217Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1137334Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1137882Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1138350Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1138821Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1139290Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1139758Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1140235Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1140703Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1141179Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1141644Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1142112Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1142512Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:44.1142605Z Autotune Choices Stats: 2025-12-04T09:41:44.1143449Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:44.1143546Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1143644Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1143788Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1144262Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1144738Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1145203Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1145672Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1146135Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1146647Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1147149Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1147618Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1148146Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1148618Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1148954Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:44.1149050Z Autotune Choices Stats: 2025-12-04T09:41:44.1149883Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1149983Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1150070Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1150185Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1150658Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1151133Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1151658Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1152126Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1152595Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1153095Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1153566Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1154030Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1154492Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1154965Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1155295Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:44.1155517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1155613Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1155746Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1155997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1157406Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1157497Z graph_break [] 2025-12-04T09:41:44.1157601Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1157781Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1157878Z Autotune Choices Stats: 2025-12-04T09:41:44.1158715Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1158817Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1158905Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1159010Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1159552Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1160021Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1160493Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1160998Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1161463Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1161935Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1162448Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1162925Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1163392Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1163858Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1164186Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:44.1164279Z Autotune Choices Stats: 2025-12-04T09:41:44.1165115Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1165251Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1165344Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1165447Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1165956Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1166431Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1166906Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1167378Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1167843Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1168304Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1168775Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1169249Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1169732Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1170244Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1170587Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:44.1170760Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1170859Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1170997Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1171242Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1172217Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1172306Z graph_break [] 2025-12-04T09:41:44.1172411Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1172591Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1172684Z Autotune Choices Stats: 2025-12-04T09:41:44.1173523Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.1173620Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1173707Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1173866Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1174349Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1174822Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1175364Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1175834Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1176307Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1176774Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1177273Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1177771Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1178244Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1178724Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1179055Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:44.1179212Z Autotune Choices Stats: 2025-12-04T09:41:44.1180045Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1180144Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1180231Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1180335Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1180812Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1181326Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1181802Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1182274Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1182742Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1183213Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1183720Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1184201Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1184711Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1185188Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1185519Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:44.1185619Z Autotune Choices Stats: 2025-12-04T09:41:44.1186458Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1186555Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1186647Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1186759Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1187243Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1187737Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1188247Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1188717Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1189234Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1189702Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1190176Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1190685Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1191164Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1191639Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1191973Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:44.1192149Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1192244Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1192380Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1192669Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1193609Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1193702Z graph_break [] 2025-12-04T09:41:44.1193844Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1194026Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1194119Z Autotune Choices Stats: 2025-12-04T09:41:44.1194942Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1195050Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1195138Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1195248Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1195724Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1196198Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1196674Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1197154Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1197640Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1198165Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1198644Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1199123Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1199675Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1200158Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1200849Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.1200951Z Autotune Choices Stats: 2025-12-04T09:41:44.1201787Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1201882Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1201974Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1202082Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1202652Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1203129Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1203663Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1204141Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1204615Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1205090Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1205560Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1206039Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1206513Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1206988Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1207334Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:44.1207429Z Autotune Choices Stats: 2025-12-04T09:41:44.1208376Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1208472Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1208558Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1208673Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1209147Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1209681Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1210159Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1210643Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1211124Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1211601Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1212158Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1212629Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1213148Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1213628Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1214043Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:44.1214324Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1214475Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1214682Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1215040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1217550Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1217679Z graph_break [] 2025-12-04T09:41:44.1217832Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1218093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1218229Z Autotune Choices Stats: 2025-12-04T09:41:44.1219465Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1219721Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1219861Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1220018Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1220722Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1221474Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1222329Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1223120Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1223924Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1224708Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1225512Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1226306Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1227040Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1227805Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1228261Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:44.1228366Z Autotune Choices Stats: 2025-12-04T09:41:44.1229217Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1229325Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1229413Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1229524Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1230174Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1230802Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1231277Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1231745Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1232375Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1233036Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1233504Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1234072Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1234791Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1235284Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1235622Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:44.1235807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1235920Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1236054Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1236304Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1238021Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1238169Z graph_break [] 2025-12-04T09:41:44.1238276Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1238492Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1238599Z Autotune Choices Stats: 2025-12-04T09:41:44.1239834Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1239942Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1240047Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1240154Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1240654Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1241237Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1241886Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1242367Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1242835Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1243305Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1243967Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1244592Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1245074Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1245604Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1245985Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:44.1246124Z Autotune Choices Stats: 2025-12-04T09:41:44.1247178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1247277Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1247367Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1247487Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1248011Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1248668Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1249304Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1249826Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1250304Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1250882Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1251527Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1252008Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1252489Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1253140Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1253576Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:44.1253766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1253861Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1254109Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1254358Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1256043Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1256139Z graph_break [] 2025-12-04T09:41:44.1256256Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1256610Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1256745Z Autotune Choices Stats: 2025-12-04T09:41:44.1257908Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.1258062Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1258186Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1258342Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1258952Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1259576Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1260328Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1260911Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1261452Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1261944Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1262438Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1262911Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1263385Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1263864Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1264201Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:44.1264300Z Autotune Choices Stats: 2025-12-04T09:41:44.1265139Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1265281Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1265374Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1265479Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1265966Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1266450Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1266960Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1267443Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1267963Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1268436Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1268911Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1269411Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1269925Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1270398Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1270778Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:44.1270956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1271058Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1271190Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1271437Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1272827Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1272914Z graph_break [] 2025-12-04T09:41:44.1273028Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1273204Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1273296Z Autotune Choices Stats: 2025-12-04T09:41:44.1274134Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1274231Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1274326Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1274431Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1274952Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1275433Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1275910Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1276431Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1276912Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1277391Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1277929Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1278401Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1278882Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1279389Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1279805Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:44.1279899Z Autotune Choices Stats: 2025-12-04T09:41:44.1280785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.1283038Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1283126Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1283232Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1283733Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1284213Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1284690Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1285162Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1285738Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1286306Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1286835Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1287320Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1287794Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1294361Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1294725Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:44.1294911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1295021Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1295160Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1295422Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1296818Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1296913Z graph_break [] 2025-12-04T09:41:44.1297027Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1297205Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1297310Z Autotune Choices Stats: 2025-12-04T09:41:44.1298266Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1298364Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1298461Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1298573Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1299070Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1299680Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1300157Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1300937Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1301408Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1301888Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1302363Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1302929Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1303412Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1303888Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1304235Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:44.1304389Z Autotune Choices Stats: 2025-12-04T09:41:44.1305231Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1305334Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1305423Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1305538Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1306018Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1306500Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1306979Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1307448Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1308042Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1308515Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1308987Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1309532Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1310016Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1310493Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1310830Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:44.1311015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1311117Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1311257Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1311508Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1312452Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1312586Z graph_break [] 2025-12-04T09:41:44.1312696Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1312873Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1312974Z Autotune Choices Stats: 2025-12-04T09:41:44.1313855Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.1313961Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1314050Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1314163Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1314656Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1315130Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1315610Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1316086Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1316571Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1317048Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1317572Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1318066Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1318590Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1319068Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1319403Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:44.1319598Z Autotune Choices Stats: 2025-12-04T09:41:44.1320448Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1320547Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1320642Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1320750Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1321235Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1321727Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1322250Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1322728Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1323200Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1323718Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1324193Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1324670Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1325153Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1325633Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1325973Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:44.1326069Z Autotune Choices Stats: 2025-12-04T09:41:44.1326957Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1327063Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1327151Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1327270Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1327804Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1328328Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1328810Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1329299Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1329785Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1330267Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1330760Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1331241Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1331776Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1332377Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1332717Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:44.1332956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1333060Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1333201Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1333458Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1334408Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1334500Z graph_break [] 2025-12-04T09:41:44.1334608Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1334786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1334890Z Autotune Choices Stats: 2025-12-04T09:41:44.1335730Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1335834Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1335925Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1336035Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1336591Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1337080Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1337590Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1338146Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1338622Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1339111Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1339580Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1340059Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1340529Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1341044Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1341381Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.1341476Z Autotune Choices Stats: 2025-12-04T09:41:44.1342312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1342450Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1342549Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1342658Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1343134Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1343623Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1344091Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1344562Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1345037Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1345512Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1346020Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1346497Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1346978Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1347524Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1347887Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:44.1347982Z Autotune Choices Stats: 2025-12-04T09:41:44.1348809Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1348916Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1349006Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1349130Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1349606Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1350081Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1350603Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1351096Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1351580Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1352098Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1352590Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1353078Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1353547Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1354021Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1354360Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:44.1354543Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1354643Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1354778Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1355031Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1356011Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1356107Z graph_break [] 2025-12-04T09:41:44.1356257Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1356436Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1356539Z Autotune Choices Stats: 2025-12-04T09:41:44.1357428Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1357538Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1357629Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1357742Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1358233Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1358707Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1359201Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1359827Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1360318Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1360801Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1361316Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1361799Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1362272Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1362746Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1363081Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.1363175Z Autotune Choices Stats: 2025-12-04T09:41:44.1364017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.1364115Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1364212Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1364329Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1364845Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1365331Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1365800Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1366321Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1366792Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1367264Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1367785Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1368261Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1368742Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1369257Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1369598Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:44.1369694Z Autotune Choices Stats: 2025-12-04T09:41:44.1370530Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1370635Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1370725Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1370919Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1371401Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1371875Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1372350Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1372821Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1373309Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1373774Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1374291Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1374773Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1375247Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1375776Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1376111Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:44.1376297Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1376396Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1376535Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1376798Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1378183Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1378278Z graph_break [] 2025-12-04T09:41:44.1378389Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1378606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1378704Z Autotune Choices Stats: 2025-12-04T09:41:44.1379543Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1379646Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1379737Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1379848Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1380371Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1380848Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1381335Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1381807Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1382276Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1382754Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1383230Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1383750Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1384224Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1384704Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1385087Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:44.1385187Z Autotune Choices Stats: 2025-12-04T09:41:44.1386048Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1386151Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1386249Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1386357Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1386840Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1387328Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1387809Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1388364Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1388846Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1389315Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1389834Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1390313Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1390810Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1391284Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1391623Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:44.1391804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1391902Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1392047Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1392299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1393719Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1393811Z graph_break [] 2025-12-04T09:41:44.1393918Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1394106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1394247Z Autotune Choices Stats: 2025-12-04T09:41:44.1395095Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.1395194Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1395286Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1395401Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1395890Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1396363Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1396842Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1397311Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1397836Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1398357Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1398834Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1399351Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1399901Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1400609Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1400953Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:44.1401057Z Autotune Choices Stats: 2025-12-04T09:41:44.1401890Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1401995Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1402086Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1402195Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1402690Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1403243Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1403727Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1404271Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1404742Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1405230Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1405701Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1406186Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1406672Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1407159Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1407578Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.1407759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1407861Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1407996Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1408246Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1409706Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1409801Z graph_break [] 2025-12-04T09:41:44.1409914Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1410090Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1410188Z Autotune Choices Stats: 2025-12-04T09:41:44.1411052Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1411151Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1411247Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1411357Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1411855Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1412341Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1412915Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1413402Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1413944Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1414432Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1414913Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1415385Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1415861Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1416337Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1416681Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:44.1416817Z Autotune Choices Stats: 2025-12-04T09:41:44.1417690Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1417804Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1417909Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1418035Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1418517Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1419032Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1419510Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1419987Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1420465Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1420935Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1421417Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1421898Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1422415Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1422900Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1423277Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:44.1423463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1423563Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1423700Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1423959Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1425331Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1425426Z graph_break [] 2025-12-04T09:41:44.1425538Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1425715Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1425819Z Autotune Choices Stats: 2025-12-04T09:41:44.1426657Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1426805Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1426897Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1427007Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1427495Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1428019Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1428555Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1429043Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1429529Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1430019Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1430503Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1430984Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1431465Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1431984Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1432320Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.1432417Z Autotune Choices Stats: 2025-12-04T09:41:44.1433324Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1433422Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1433515Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1433632Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1434115Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1434597Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1435079Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1435559Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1436039Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1436555Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1437033Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1437510Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1438041Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1438522Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1438859Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:44.1439040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1439140Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1439280Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1439604Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1440553Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1440646Z graph_break [] 2025-12-04T09:41:44.1440754Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1440931Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1441033Z Autotune Choices Stats: 2025-12-04T09:41:44.1441924Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.1442098Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1442188Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1442301Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1442800Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1443278Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1443760Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1444234Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1444715Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1445198Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1445709Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1446192Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1446662Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1447179Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1447525Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:44.1447640Z Autotune Choices Stats: 2025-12-04T09:41:44.1448506Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1448604Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1448700Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1448808Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1449288Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1449780Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1450251Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1450770Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1451242Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1451716Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1452240Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1452718Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1453200Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1453677Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1454015Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:44.1454109Z Autotune Choices Stats: 2025-12-04T09:41:44.1454946Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1455083Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1455173Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1455294Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1455771Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1456246Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1456767Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1457245Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1457750Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1458228Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1458715Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1459208Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1459684Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1460203Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1460537Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:44.1460720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1460856Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1460994Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1461250Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1462628Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1462723Z graph_break [] 2025-12-04T09:41:44.1462833Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1463010Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1463111Z Autotune Choices Stats: 2025-12-04T09:41:44.1463962Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.1464071Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1464204Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1464315Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1464813Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1465292Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1465776Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1466290Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1466780Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1467254Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1467733Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1468210Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1468694Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1469180Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1469554Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:44.1469654Z Autotune Choices Stats: 2025-12-04T09:41:44.1470496Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1470633Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1470734Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1470844Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1471328Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1471822Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1472305Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1472778Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1473249Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1473727Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1474242Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1474718Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1475203Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1475717Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1476066Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:44.1476243Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1476342Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1476486Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1476736Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1478160Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1478248Z graph_break [] 2025-12-04T09:41:44.1478356Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1478548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1478642Z Autotune Choices Stats: 2025-12-04T09:41:44.1479650Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1479752Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1479846Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1479999Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1480490Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1480981Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1481461Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1481941Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1482429Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1482911Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1483393Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1483910Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1484387Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1484857Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1485237Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:44.1485341Z Autotune Choices Stats: 2025-12-04T09:41:44.1486188Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1486299Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1486392Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1486498Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1486981Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1487460Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1487995Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1488470Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1488991Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1489465Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1489976Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1490459Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1490940Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1491424Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1491760Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:44.1491940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1492042Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1492179Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1492436Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1493860Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1493955Z graph_break [] 2025-12-04T09:41:44.1494065Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1494244Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1494348Z Autotune Choices Stats: 2025-12-04T09:41:44.1495233Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1495336Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1495432Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1495543Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1496038Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1496520Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1497006Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1497493Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1497979Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1498500Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1498976Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1499504Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1499982Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1500740Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1501089Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:44.1501188Z Autotune Choices Stats: 2025-12-04T09:41:44.1502040Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1502141Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1502232Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1502347Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1502914Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1503407Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1503882Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1504414Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1504891Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1505369Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1505854Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1506335Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1506826Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1507308Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1507664Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:44.1507933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1508033Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1508179Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1508431Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1509821Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1509971Z graph_break [] 2025-12-04T09:41:44.1510079Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1510265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1510360Z Autotune Choices Stats: 2025-12-04T09:41:44.1511204Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1511312Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1511406Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1511529Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1512013Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1512533Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1513019Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1513499Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1514043Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1514523Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1515013Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1515500Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1515982Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1516479Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1516816Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:44.1516918Z Autotune Choices Stats: 2025-12-04T09:41:44.1517834Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1517954Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1518051Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1518159Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1518650Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1519172Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1519703Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1520182Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1520653Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1521130Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1521601Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1522132Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1522612Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1523086Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1523430Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:44.1523647Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1523753Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1523890Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1524142Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1525098Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1525183Z graph_break [] 2025-12-04T09:41:44.1525297Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1525474Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1525574Z Autotune Choices Stats: 2025-12-04T09:41:44.1526424Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1526526Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1526624Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1526736Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1527258Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1527742Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1528261Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1528746Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1529233Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1529724Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1530202Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1530675Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1531154Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1531662Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1532003Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:44.1532099Z Autotune Choices Stats: 2025-12-04T09:41:44.1532947Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1533090Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1533185Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1533294Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1533792Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1534267Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1534743Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1535220Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1535699Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1536171Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1536683Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1537171Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1537705Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1538216Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1538550Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:44.1538652Z Autotune Choices Stats: 2025-12-04T09:41:44.1539483Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1539579Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1539678Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1539793Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1540278Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1540762Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1541282Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1541759Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1542241Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1542771Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1543256Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1543744Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1544232Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1544710Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1545051Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:44.1545225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1545333Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1545466Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1545757Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1547144Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1547271Z graph_break [] 2025-12-04T09:41:44.1547383Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1547560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1547657Z Autotune Choices Stats: 2025-12-04T09:41:44.1548563Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1548661Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1548754Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1548860Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1549340Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1549828Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1550301Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1550853Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1551329Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1551802Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1552334Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1552807Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1553290Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1553759Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1554101Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:44.1554195Z Autotune Choices Stats: 2025-12-04T09:41:44.1555033Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1555138Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1555226Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1555373Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1555863Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1556338Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1556857Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1557331Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1557806Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1558271Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1558738Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1559222Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1559752Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1560277Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1560612Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:44.1560795Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1560893Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1561030Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1561375Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1562325Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1562418Z graph_break [] 2025-12-04T09:41:44.1562525Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1562702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1562800Z Autotune Choices Stats: 2025-12-04T09:41:44.1563637Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1563740Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1563834Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1563944Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1564424Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1564947Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1565426Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1565910Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1566433Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1566920Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1567410Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1567921Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1568423Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1568895Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1569278Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:44.1569375Z Autotune Choices Stats: 2025-12-04T09:41:44.1570216Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1570313Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1570405Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1570514Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1571031Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1571504Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1571979Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1572454Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1572929Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1573403Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1573880Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1574398Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1574879Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1575353Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1575728Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.1575833Z Autotune Choices Stats: 2025-12-04T09:41:44.1582064Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:44.1582189Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1582282Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1582404Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1582905Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1583392Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1583880Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1584421Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1584894Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1585376Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1585892Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1586375Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1586859Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1587335Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1587669Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.1587855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1587954Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1588093Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1588344Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1589780Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1589873Z graph_break [] 2025-12-04T09:41:44.1589980Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1590226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1590322Z Autotune Choices Stats: 2025-12-04T09:41:44.1591164Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1591267Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1591357Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1591472Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1591957Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1592445Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1592927Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1593404Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1593923Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1594393Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1594871Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1595389Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1595866Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1596344Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1596684Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:44.1596781Z Autotune Choices Stats: 2025-12-04T09:41:44.1597656Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1597778Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1597870Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1597978Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1598464Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1598977Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1599459Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1600036Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1600685Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1601163Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1601634Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1602115Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1602601Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1603082Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1603503Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:44.1603681Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1603784Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1603918Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1604170Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1605607Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1605697Z graph_break [] 2025-12-04T09:41:44.1605813Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1605992Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1606090Z Autotune Choices Stats: 2025-12-04T09:41:44.1606922Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1607020Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1607111Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1607229Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1607739Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1608243Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1608775Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1609249Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1609787Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1610255Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1610736Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1611205Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1611683Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1612165Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1612510Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:44.1612648Z Autotune Choices Stats: 2025-12-04T09:41:44.1613498Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1613601Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1613693Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1613803Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1614319Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1614789Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1615268Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1615736Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1616205Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1616675Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1617146Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1617626Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1618134Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1618610Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1618984Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:44.1619167Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1619303Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1619494Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1619861Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1621461Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1621553Z graph_break [] 2025-12-04T09:41:44.1621659Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1621837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1621935Z Autotune Choices Stats: 2025-12-04T09:41:44.1622774Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.1622934Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1623020Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1623126Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1623608Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1624121Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1624594Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1625072Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1625550Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1626024Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1626508Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1626991Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1627468Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1628049Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1628381Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:44.1628514Z Autotune Choices Stats: 2025-12-04T09:41:44.1629367Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1629464Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1629556Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1629660Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1630139Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1630615Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1631080Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1631556Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1632067Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1632535Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1633007Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1633519Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1633999Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1634473Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1634811Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:44.1634987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1635082Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1635217Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1635468Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1636420Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1636507Z graph_break [] 2025-12-04T09:41:44.1636611Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1636789Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1636922Z Autotune Choices Stats: 2025-12-04T09:41:44.1637805Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1637941Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1638028Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1638138Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1638615Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1639092Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1639642Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1640115Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1640596Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1641067Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1641588Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1642057Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1642522Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1643030Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1643362Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.1643465Z Autotune Choices Stats: 2025-12-04T09:41:44.1644291Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1644384Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1644477Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1644582Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1645057Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1645533Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1646000Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1646519Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1646986Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1647546Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1648010Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1648491Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1648963Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1649431Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1649769Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:44.1649862Z Autotune Choices Stats: 2025-12-04T09:41:44.1650694Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1650830Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1650916Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1651035Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1651508Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1651979Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1652487Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1652966Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1653440Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1653905Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1654379Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1654853Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1655332Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1655845Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1656182Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:44.1656357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1656493Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1656635Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1656881Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1657860Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1657956Z graph_break [] 2025-12-04T09:41:44.1658063Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1658242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1658339Z Autotune Choices Stats: 2025-12-04T09:41:44.1659183Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.1659287Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1659373Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1659486Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1660007Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1660474Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1660947Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1661484Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1661967Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1662450Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1662922Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1663392Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1663861Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1664339Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1664674Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:44.1664769Z Autotune Choices Stats: 2025-12-04T09:41:44.1665639Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1665734Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1665863Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1665968Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1666450Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1666924Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1667398Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1667900Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1668391Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1668865Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1669368Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1669848Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1670317Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1670828Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1671163Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:44.1671254Z Autotune Choices Stats: 2025-12-04T09:41:44.1672099Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1672192Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1672282Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1672395Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1672874Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1673370Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1673843Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1674362Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1674853Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1675325Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1675848Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1676324Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1676808Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1677278Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1677604Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:44.1677787Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1677880Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1678015Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1678263Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1679720Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1679811Z graph_break [] 2025-12-04T09:41:44.1679920Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1680101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1680234Z Autotune Choices Stats: 2025-12-04T09:41:44.1681058Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.1681161Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1681249Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1681364Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1681837Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1682304Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1682786Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1683260Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1683782Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1684263Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1684732Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1685239Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1685707Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1686187Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1686518Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:44.1686613Z Autotune Choices Stats: 2025-12-04T09:41:44.1687444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1687540Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1687629Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1687797Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1688303Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1688773Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1689243Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1689754Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1690225Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1690697Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1691170Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1691645Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1692120Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1692590Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1692926Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:44.1693142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1693243Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1693374Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1693617Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1694600Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1694682Z graph_break [] 2025-12-04T09:41:44.1694794Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1694968Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1695058Z Autotune Choices Stats: 2025-12-04T09:41:44.1695896Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1695991Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1696080Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1696188Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1696666Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1697155Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.1697704Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1698230Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1698698Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1699205Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1699676Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1700144Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1700784Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1701259Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1701599Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.1701692Z Autotune Choices Stats: 2025-12-04T09:41:44.1702599Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:44.1702705Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1702793Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1702897Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1703375Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1703899Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1704370Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1704850Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1705322Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1705789Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1706261Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1706743Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1707271Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1707748Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1708077Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:44.1708178Z Autotune Choices Stats: 2025-12-04T09:41:44.1709352Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1709451Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1709544Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1709656Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1710139Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1710619Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1711093Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1711568Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1712043Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1712570Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1713037Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1713550Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1714032Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1714501Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1714834Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:44.1715011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1715112Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1715246Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1715493Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1716875Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1717052Z graph_break [] 2025-12-04T09:41:44.1717160Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1717339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1717430Z Autotune Choices Stats: 2025-12-04T09:41:44.1718390Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1718490Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1718584Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1718695Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1719179Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1719702Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1720178Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1720658Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1721142Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1721622Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1722143Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1722617Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1723129Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1723601Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1723935Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:44.1724038Z Autotune Choices Stats: 2025-12-04T09:41:44.1724881Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1724977Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1725068Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1725172Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1725660Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1726174Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1726654Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1727127Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1727637Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1728157Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1728625Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1729106Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1729588Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1730073Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1730408Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:44.1730584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1730683Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1730814Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1731102Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1732058Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1732206Z graph_break [] 2025-12-04T09:41:44.1732315Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1732492Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1732585Z Autotune Choices Stats: 2025-12-04T09:41:44.1733450Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1733550Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1733636Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1733747Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1734226Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1734715Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1735183Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1735695Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1736167Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1736642Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1737154Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1737629Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1738108Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1738576Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1738916Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:44.1739011Z Autotune Choices Stats: 2025-12-04T09:41:44.1739844Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1739945Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1740031Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1740138Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1740656Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1741131Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1741656Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1742123Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1742596Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1743065Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1743533Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1744018Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1744492Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1745013Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1745350Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:44.1745451Z Autotune Choices Stats: 2025-12-04T09:41:44.1746290Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.1746423Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1746515Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1746626Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1747107Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1747584Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1748102Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1748587Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1749064Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1749549Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1750063Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1750535Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1751050Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1751520Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1751858Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:44.1752031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1752133Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1752267Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1752513Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1753892Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1754017Z graph_break [] 2025-12-04T09:41:44.1754135Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1754311Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1754403Z Autotune Choices Stats: 2025-12-04T09:41:44.1755261Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1755361Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1755444Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1755555Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1756075Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1756559Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1757036Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1757523Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1758049Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1758527Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1759015Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1759585Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1760061Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1760575Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1760914Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.1761009Z Autotune Choices Stats: 2025-12-04T09:41:44.1761855Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1761961Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1762050Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1762154Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1762646Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1763122Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1763593Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1764103Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1764580Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1765048Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1765559Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1766040Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1766514Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1766999Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1767329Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:44.1767509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1767604Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1767735Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1767989Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1768995Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1769083Z graph_break [] 2025-12-04T09:41:44.1769191Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1769368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1769504Z Autotune Choices Stats: 2025-12-04T09:41:44.1770347Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.1770443Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1770536Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1770645Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1771134Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1771602Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1772083Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1772564Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1773078Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1773560Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1774028Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1774549Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1775020Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1775497Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1775837Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:44.1775929Z Autotune Choices Stats: 2025-12-04T09:41:44.1776773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1776872Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1776958Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1777069Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1777547Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1778120Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1778595Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1779064Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1779576Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1780047Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1780523Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1780998Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1781475Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1781951Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1782319Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:44.1782417Z Autotune Choices Stats: 2025-12-04T09:41:44.1783276Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.1783379Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1783471Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1783582Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1784113Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1784587Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1785062Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1785534Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1786015Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1786496Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1786973Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1787491Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1787992Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1788485Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1788856Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:44.1789029Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1789134Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1789265Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1789517Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1790458Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1790544Z graph_break [] 2025-12-04T09:41:44.1790655Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1790832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1790926Z Autotune Choices Stats: 2025-12-04T09:41:44.1791766Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1791901Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1791994Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1792099Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1792580Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1793063Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1793577Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1794059Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1794530Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1794998Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1795472Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1795938Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1796411Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1796929Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1797269Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:44.1797361Z Autotune Choices Stats: 2025-12-04T09:41:44.1798244Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1798338Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1798428Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1798539Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1799016Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1799555Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1800038Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1800642Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1801115Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1801656Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1802130Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1802603Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1803162Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1803638Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1803975Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:44.1804072Z Autotune Choices Stats: 2025-12-04T09:41:44.1804940Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:44.1805042Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1805130Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1805240Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1805729Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1806253Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1806724Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1807188Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1807764Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1808241Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1808718Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1809195Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1809667Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1810147Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1810479Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:44.1810708Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1810807Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1810942Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1811196Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1812609Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1812695Z graph_break [] 2025-12-04T09:41:44.1812803Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1812980Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1813077Z Autotune Choices Stats: 2025-12-04T09:41:44.1813914Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1814007Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1814099Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1814206Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1814683Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1815159Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1815682Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1816167Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1816636Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1817150Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1817659Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1818149Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1818619Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1819092Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1819429Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:44.1819521Z Autotune Choices Stats: 2025-12-04T09:41:44.1820384Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:44.1820521Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1820606Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1820714Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1821196Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1821715Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1822190Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1822659Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1823134Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1823601Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1824078Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1824551Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1825066Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1825539Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1825864Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:44.1826082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1826178Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1826314Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1826563Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1827508Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1827598Z graph_break [] 2025-12-04T09:41:44.1827701Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1827877Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1827975Z Autotune Choices Stats: 2025-12-04T09:41:44.1828807Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1828904Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1829033Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1829138Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1829623Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1830102Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1830582Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1831097Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1831587Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1832064Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1832531Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1833003Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1833478Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1833952Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1834322Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.1834416Z Autotune Choices Stats: 2025-12-04T09:41:44.1835257Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1835388Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1835478Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1835584Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1836057Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1836545Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1837020Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1837503Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1838022Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1838493Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1839029Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1839547Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1840035Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1840550Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1840890Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:44.1840984Z Autotune Choices Stats: 2025-12-04T09:41:44.1841816Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.1841912Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1841997Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1842115Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1842593Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1843064Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1843541Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1844045Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1844518Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1845029Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1845500Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1845979Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1846452Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1846927Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1847263Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:44.1847439Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1847532Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1847729Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1848001Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1849387Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1849475Z graph_break [] 2025-12-04T09:41:44.1849579Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1849790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1849890Z Autotune Choices Stats: 2025-12-04T09:41:44.1850746Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1850848Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1850936Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1851040Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1851522Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1852001Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1852481Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1852958Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1853472Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1853943Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1854457Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1854931Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1855399Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1855883Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1856215Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:44.1856308Z Autotune Choices Stats: 2025-12-04T09:41:44.1857158Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1857291Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1857380Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1857484Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1857964Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1858442Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1858956Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1859430Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1859902Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1860375Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1860841Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1866140Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1866642Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1867131Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1867530Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:44.1867712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1867809Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1867941Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1868286Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1869663Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1869750Z graph_break [] 2025-12-04T09:41:44.1869859Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1870033Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1870128Z Autotune Choices Stats: 2025-12-04T09:41:44.1870959Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3249", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1871055Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1871144Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1871250Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1871775Z triton_mm_3249 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1872261Z triton_mm_3251 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1872732Z triton_mm_3252 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1873248Z triton_mm_3253 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1873727Z triton_mm_3254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1874208Z triton_mm_3255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1874677Z triton_mm_3246 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1875144Z triton_mm_3244 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1875618Z triton_mm_3245 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1876081Z triton_mm_3247 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1876421Z SingleProcess AUTOTUNE benchmarking takes 0.2118 seconds and 0.6062 seconds precompiling for 15 choices 2025-12-04T09:41:44.1876517Z Autotune Choices Stats: 2025-12-04T09:41:44.1877399Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3275", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1877493Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1877642Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1877754Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1878231Z triton_mm_3275 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1878714Z triton_mm_3276 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1879187Z triton_mm_3277 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1879771Z triton_mm_3280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1880244Z triton_mm_3274 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1880712Z triton_mm_3279 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1881229Z triton_mm_3278 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1881704Z triton_mm_3283 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1882183Z triton_mm_3284 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1882694Z triton_mm_3281 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1883029Z SingleProcess AUTOTUNE benchmarking takes 0.1792 seconds and 1.8272 seconds precompiling for 13 choices 2025-12-04T09:41:44.1883212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1883314Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1883449Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1883702Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1885073Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1885167Z graph_break [] 2025-12-04T09:41:44.1885271Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1885449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1885542Z Autotune Choices Stats: 2025-12-04T09:41:44.1886424Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.1886526Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1886611Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1886720Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1887208Z triton_mm_3300 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1887728Z triton_mm_3288 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1888207Z triton_mm_3289 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1888680Z triton_mm_3294 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1889173Z triton_mm_3295 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1889646Z triton_mm_3296 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1890114Z triton_mm_3287 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1890624Z triton_mm_3290 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1891089Z triton_mm_3291 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1891561Z triton_mm_3292 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1891894Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6016 seconds precompiling for 15 choices 2025-12-04T09:41:44.1892028Z Autotune Choices Stats: 2025-12-04T09:41:44.1892863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3318", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028575999662280083, "best_triton_pos": 0} 2025-12-04T09:41:44.1892957Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1893046Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1893152Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1893631Z triton_mm_3318 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1894109Z triton_mm_3319 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1894577Z triton_mm_3320 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1895046Z triton_mm_3323 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1895551Z triton_mm_3317 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1896022Z triton_mm_3322 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1896487Z triton_mm_3321 0.0337 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1897002Z triton_mm_3324 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1897476Z triton_mm_3326 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1897987Z triton_mm_3327 0.0348 ms 82.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1898339Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8359 seconds precompiling for 13 choices 2025-12-04T09:41:44.1898510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1898609Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1898744Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1898993Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1900553Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1900712Z graph_break [] 2025-12-04T09:41:44.1900824Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1901020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1901115Z Autotune Choices Stats: 2025-12-04T09:41:44.1902185Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3344", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.1902282Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1902379Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1902487Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1903063Z triton_mm_3344 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1903621Z triton_mm_3332 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1904176Z triton_mm_3333 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1904742Z triton_mm_3339 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1905298Z triton_mm_3341 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1905909Z triton_mm_3331 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1906471Z triton_mm_3338 0.0285 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1907023Z triton_mm_3330 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1907634Z triton_mm_3334 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1908238Z triton_mm_3335 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1908629Z SingleProcess AUTOTUNE benchmarking takes 0.2046 seconds and 0.6449 seconds precompiling for 15 choices 2025-12-04T09:41:44.1908723Z Autotune Choices Stats: 2025-12-04T09:41:44.1909716Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.1909819Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1909910Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1910016Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1910577Z triton_mm_3366 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1911170Z triton_mm_3360 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1911650Z triton_mm_3361 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1912114Z triton_mm_3362 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1912625Z triton_mm_3363 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1913101Z triton_mm_3365 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1913576Z triton_mm_3367 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1914046Z triton_mm_3364 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1914519Z triton_mm_3369 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1915005Z triton_mm_3370 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1915337Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.9040 seconds precompiling for 13 choices 2025-12-04T09:41:44.1915516Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1915611Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1915807Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1916057Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1916997Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1917147Z graph_break [] 2025-12-04T09:41:44.1917252Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1917425Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1917522Z Autotune Choices Stats: 2025-12-04T09:41:44.1918360Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3382", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.1918455Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1918545Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1918650Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1919133Z triton_mm_3382 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1919656Z triton_mm_3374 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1920127Z triton_mm_3379 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1920652Z triton_mm_3380 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1921127Z triton_mm_3381 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1921611Z triton_mm_3384 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1922119Z triton_mm_3373 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1922592Z triton_mm_3375 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1923058Z triton_mm_3376 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1923523Z triton_mm_3377 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1923860Z SingleProcess AUTOTUNE benchmarking takes 0.2041 seconds and 0.6036 seconds precompiling for 15 choices 2025-12-04T09:41:44.1923953Z Autotune Choices Stats: 2025-12-04T09:41:44.1924779Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3406", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02860799990594387, "best_triton_pos": 0} 2025-12-04T09:41:44.1924872Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1924959Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1925067Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1925596Z triton_mm_3406 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1926078Z triton_mm_3404 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1926587Z triton_mm_3405 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1927055Z triton_mm_3409 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1927550Z triton_mm_3403 0.0297 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1928046Z triton_mm_3408 0.0307 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1928514Z triton_mm_3407 0.0328 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1928989Z triton_mm_3410 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1929463Z triton_mm_3412 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1929973Z triton_mm_3413 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1930302Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8691 seconds precompiling for 13 choices 2025-12-04T09:41:44.1930398Z Autotune Choices Stats: 2025-12-04T09:41:44.1931271Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3437", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.1931372Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1931458Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1931572Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1932052Z triton_mm_3437 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1932525Z triton_mm_3438 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1933005Z triton_mm_3440 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1933484Z triton_mm_3442 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1933965Z triton_mm_3443 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1934435Z triton_mm_3430 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1934941Z triton_mm_3435 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1935415Z triton_mm_3429 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1935920Z triton_mm_3433 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1936396Z triton_mm_3436 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1936731Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6051 seconds precompiling for 15 choices 2025-12-04T09:41:44.1936905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1937002Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1937132Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1937381Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1938381Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1938464Z graph_break [] 2025-12-04T09:41:44.1938613Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1938786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1938881Z Autotune Choices Stats: 2025-12-04T09:41:44.1939713Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3462", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1939805Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1939895Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1940002Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1940516Z triton_mm_3462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1941001Z triton_mm_3466 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1941482Z triton_mm_3467 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1941963Z triton_mm_3472 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1942432Z triton_mm_3459 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1942911Z triton_mm_3460 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1943380Z triton_mm_3461 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1943892Z triton_mm_3463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1944359Z triton_mm_3465 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1944831Z triton_mm_3468 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1945212Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.5791 seconds precompiling for 15 choices 2025-12-04T09:41:44.1945304Z Autotune Choices Stats: 2025-12-04T09:41:44.1946136Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3489", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.1946235Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1946320Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1946427Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1946900Z triton_mm_3489 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1947383Z triton_mm_3490 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1947857Z triton_mm_3491 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1948375Z triton_mm_3492 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1948851Z triton_mm_3495 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1949317Z triton_mm_3494 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1949851Z triton_mm_3493 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1950325Z triton_mm_3498 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1950813Z triton_mm_3496 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1951283Z triton_mm_3499 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1951613Z SingleProcess AUTOTUNE benchmarking takes 0.1783 seconds and 1.8202 seconds precompiling for 13 choices 2025-12-04T09:41:44.1951710Z Autotune Choices Stats: 2025-12-04T09:41:44.1952544Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3516", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1952643Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1952726Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1952837Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1953356Z triton_mm_3516 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1953831Z triton_mm_3517 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1954346Z triton_mm_3520 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1954823Z triton_mm_3522 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1955302Z triton_mm_3524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1955789Z triton_mm_3525 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1956256Z triton_mm_3515 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1956728Z triton_mm_3518 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1957195Z triton_mm_3519 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1957715Z triton_mm_3521 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1958086Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.1958259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1958358Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1958492Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1958744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1959781Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1959868Z graph_break [] 2025-12-04T09:41:44.1959976Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1960150Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1960242Z Autotune Choices Stats: 2025-12-04T09:41:44.1961071Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3547", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1961166Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1961260Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1961365Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1961841Z triton_mm_3547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1962367Z triton_mm_3552 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1962847Z triton_mm_3556 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1963318Z triton_mm_3548 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1963830Z triton_mm_3546 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1964310Z triton_mm_3555 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1964781Z triton_mm_3549 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1965247Z triton_mm_3550 0.0297 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1965717Z triton_mm_3545 0.0306 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1966183Z triton_mm_3551 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1966527Z SingleProcess AUTOTUNE benchmarking takes 0.2059 seconds and 0.6126 seconds precompiling for 15 choices 2025-12-04T09:41:44.1966659Z Autotune Choices Stats: 2025-12-04T09:41:44.1967516Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3577", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02860799990594387, "best_triton_pos": 0} 2025-12-04T09:41:44.1967632Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1967724Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1967847Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1968360Z triton_mm_3577 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1968835Z triton_mm_3576 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1969307Z triton_mm_3578 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1969774Z triton_mm_3581 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1970245Z triton_mm_3575 0.0297 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1970716Z triton_mm_3580 0.0307 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1971184Z triton_mm_3579 0.0328 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1971699Z triton_mm_3584 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1972176Z triton_mm_3585 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1972650Z triton_mm_3582 0.0348 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1973022Z SingleProcess AUTOTUNE benchmarking takes 0.1807 seconds and 1.8175 seconds precompiling for 13 choices 2025-12-04T09:41:44.1973116Z Autotune Choices Stats: 2025-12-04T09:41:44.1973941Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3603", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1974045Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1974134Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1974244Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.1974725Z triton_mm_3603 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1975204Z triton_mm_3607 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1975689Z triton_mm_3612 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1976200Z triton_mm_3611 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1976668Z triton_mm_3606 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1977142Z triton_mm_3605 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1977654Z triton_mm_3608 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1978130Z triton_mm_3601 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1978602Z triton_mm_3602 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1979078Z triton_mm_3604 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1979411Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.1979587Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.1979684Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.1979815Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.1980060Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.1981490Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.1981577Z graph_break [] 2025-12-04T09:41:44.1981686Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.1981859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.1981990Z Autotune Choices Stats: 2025-12-04T09:41:44.1982841Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3638", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.1982939Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1983029Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1983132Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.1983609Z triton_mm_3638 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1984089Z triton_mm_3639 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1984569Z triton_mm_3642 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1985039Z triton_mm_3634 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1985583Z triton_mm_3636 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1986073Z triton_mm_3644 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1986538Z triton_mm_3631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1987050Z triton_mm_3633 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1987520Z triton_mm_3637 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1988029Z triton_mm_3640 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1988388Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6077 seconds precompiling for 15 choices 2025-12-04T09:41:44.1988479Z Autotune Choices Stats: 2025-12-04T09:41:44.1989308Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3667", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.1989410Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.1989495Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.1989603Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.1990076Z triton_mm_3667 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.1990585Z triton_mm_3664 0.0286 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1991063Z triton_mm_3662 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1991533Z triton_mm_3663 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.1992049Z triton_mm_3661 0.0297 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.1992519Z triton_mm_3666 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1992994Z triton_mm_3665 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.1993466Z triton_mm_3670 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1993949Z triton_mm_3671 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.1994420Z triton_mm_3668 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.1994790Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8616 seconds precompiling for 13 choices 2025-12-04T09:41:44.1995015Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:44.1995121Z Traceback (most recent call last): 2025-12-04T09:41:44.1995542Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.1995734Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:44.1996078Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:44.1996270Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:44.1996470Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.1996555Z Searched string: 2025-12-04T09:41:44.1996692Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.1996702Z 2025-12-04T09:41:44.1996818Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.1996823Z 2025-12-04T09:41:44.1996954Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.1997078Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.1997082Z 2025-12-04T09:41:44.1997173Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.1997267Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.1997362Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.1997453Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.1997457Z 2025-12-04T09:41:44.1997549Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.1997649Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.1997759Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.1997870Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.1997876Z 2025-12-04T09:41:44.1997881Z 2025-12-04T09:41:44.1998041Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.1998049Z 2025-12-04T09:41:44.1998052Z 2025-12-04T09:41:44.1998175Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.1998290Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.1998446Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.1998539Z idx_m = rm[:, None] 2025-12-04T09:41:44.1998622Z idx_n = rn[None, :] 2025-12-04T09:41:44.1998713Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.1998718Z 2025-12-04T09:41:44.1998816Z # inductor generates a suffix 2025-12-04T09:41:44.1998907Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.1999162Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.1999256Z ''', device_str='cuda') 2025-12-04T09:41:44.1999263Z 2025-12-04T09:41:44.1999267Z 2025-12-04T09:41:44.1999368Z async_compile.wait(globals()) 2025-12-04T09:41:44.1999456Z del async_compile 2025-12-04T09:41:44.1999461Z 2025-12-04T09:41:44.1999594Z class Runner: 2025-12-04T09:41:44.1999698Z def __init__(self, partitions): 2025-12-04T09:41:44.1999808Z self.partitions = partitions 2025-12-04T09:41:44.1999811Z 2025-12-04T09:41:44.1999917Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.2000009Z new_callables = [] 2025-12-04T09:41:44.2000138Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.2000380Z new_callables.append(fn(c)) 2025-12-04T09:41:44.2000484Z self.partitions = new_callables 2025-12-04T09:41:44.2000492Z 2025-12-04T09:41:44.2000582Z def call(self, args): 2025-12-04T09:41:44.2000669Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.2000761Z args.clear() 2025-12-04T09:41:44.2000888Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.2001015Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.2001128Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.2001224Z torch.cuda.set_device(0) 2025-12-04T09:41:44.2001462Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.2001685Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.2001782Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.2001984Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.2002067Z del arg0_1 2025-12-04T09:41:44.2002230Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.2002484Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.2002584Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.2002860Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.2002947Z del arg1_1 2025-12-04T09:41:44.2003025Z del buf0 2025-12-04T09:41:44.2003108Z return (buf1, ) 2025-12-04T09:41:44.2003118Z 2025-12-04T09:41:44.2003218Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.2003299Z call = runner.call 2025-12-04T09:41:44.2003459Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.2003463Z 2025-12-04T09:41:44.2003467Z 2025-12-04T09:41:44.2003607Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.2003735Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.2003885Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.2004085Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.2004291Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.2004389Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.2004555Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.2004560Z 2025-12-04T09:41:44.2004563Z 2025-12-04T09:41:44.2004654Z if __name__ == "__main__": 2025-12-04T09:41:44.2004857Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.2005014Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.2005102Z From CHECK: .to( 2025-12-04T09:41:44.2005106Z 2025-12-04T09:41:44.2005110Z 2025-12-04T09:41:44.2005345Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.2005897Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.2005903Z 2025-12-04T09:41:44.2006175Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.2006355Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2006452Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2006581Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2006836Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2008222Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2008309Z graph_break [] 2025-12-04T09:41:44.2008412Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2008589Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2008686Z Autotune Choices Stats: 2025-12-04T09:41:44.2009527Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2009661Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2009753Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2009861Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2010353Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2010814Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2011314Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2011775Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2012235Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2012708Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2013166Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2013627Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2014081Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2014629Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2014968Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.2015058Z Autotune Choices Stats: 2025-12-04T09:41:44.2015889Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2016048Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2016133Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2016244Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2016717Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2017181Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2017635Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2018146Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2018603Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2019098Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2019557Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2020022Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2020528Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2020992Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2021326Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:44.2021504Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2021597Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2021730Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2021978Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2022921Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2023005Z graph_break [] 2025-12-04T09:41:44.2023107Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2023286Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2023378Z Autotune Choices Stats: 2025-12-04T09:41:44.2024245Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2024343Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2024427Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2024571Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2025050Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2025521Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2025999Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2026474Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2026952Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2027437Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2027934Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2028463Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2028929Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2029391Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2029763Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:44.2029858Z Autotune Choices Stats: 2025-12-04T09:41:44.2030683Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2030780Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2030873Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2030976Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2031441Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2031914Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2032382Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2032849Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2033347Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2033822Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2034323Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2034790Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2035262Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2035730Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2036059Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:44.2036151Z Autotune Choices Stats: 2025-12-04T09:41:44.2036989Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.2037121Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2037205Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2037318Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2037795Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2038250Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2038716Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2039220Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2039741Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2040214Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2040684Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2041158Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2041625Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2042086Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2042459Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:44.2042639Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2042731Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2042860Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2043146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2044081Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2044171Z graph_break [] 2025-12-04T09:41:44.2044275Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2044448Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2044546Z Autotune Choices Stats: 2025-12-04T09:41:44.2045377Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.2045475Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2045558Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2045666Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2046151Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2046656Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2047118Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2047600Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2048129Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2048601Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2049070Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2049547Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2050007Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2050474Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2050804Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:44.2050896Z Autotune Choices Stats: 2025-12-04T09:41:44.2051781Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:44.2051875Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2051964Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2052069Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2052536Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2053047Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2053510Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2053977Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2054440Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2054907Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2055369Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2055875Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2056349Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2056815Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2057147Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:44.2057275Z Autotune Choices Stats: 2025-12-04T09:41:44.2058169Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2058267Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2058351Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2058465Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2058933Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2059402Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2059882Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2060343Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2060853Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2061312Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2061772Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2062279Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2062742Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2063212Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2063538Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:44.2063715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2063812Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2063943Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2064193Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2065569Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2065697Z graph_break [] 2025-12-04T09:41:44.2065801Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2065974Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2066070Z Autotune Choices Stats: 2025-12-04T09:41:44.2066948Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2067046Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2067135Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2067240Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2067730Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2068189Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2068653Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2069118Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2069579Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2070091Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2070557Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2071032Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2071541Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2072007Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2072339Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:44.2072430Z Autotune Choices Stats: 2025-12-04T09:41:44.2073261Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2073356Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2073445Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2073555Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2074032Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2074542Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2075020Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2075479Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2075980Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2076442Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2076918Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2077391Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2077915Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2078389Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2078721Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:44.2078894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2078988Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2079163Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2079410Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2080411Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2080534Z graph_break [] 2025-12-04T09:41:44.2080640Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2080817Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2080909Z Autotune Choices Stats: 2025-12-04T09:41:44.2081749Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.2081845Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2081930Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2082038Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2082516Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2082989Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2083459Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2083966Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2084434Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2084898Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2085433Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2085905Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2086379Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2086853Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2087186Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:44.2087282Z Autotune Choices Stats: 2025-12-04T09:41:44.2088165Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2088263Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2088348Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2088451Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2088970Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2089443Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2089959Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2090425Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2090894Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2091361Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2091826Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2092301Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2092771Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2093289Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2093616Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:44.2093706Z Autotune Choices Stats: 2025-12-04T09:41:44.2094586Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2094682Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2094775Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2094884Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2095364Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2095844Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2096310Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2096782Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2097254Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2097718Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2098222Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2098692Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2099207Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2099676Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2100011Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:44.2100185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2100425Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2100561Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2100807Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2101750Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2101836Z graph_break [] 2025-12-04T09:41:44.2101937Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2102181Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2102272Z Autotune Choices Stats: 2025-12-04T09:41:44.2103115Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2103214Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2103300Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2103410Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2103939Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2104416Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2104895Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2105371Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2105851Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2106330Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2106802Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2107335Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2107916Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2108503Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2108923Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.2109023Z Autotune Choices Stats: 2025-12-04T09:41:44.2109851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2109948Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2110037Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2110140Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2110618Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2111100Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2111573Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2112102Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2112573Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2113039Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2113544Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2114021Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2114503Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2114975Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2115305Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:44.2115399Z Autotune Choices Stats: 2025-12-04T09:41:44.2116235Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2116331Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2116417Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2116531Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2117042Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2117531Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2118087Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2118562Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2119043Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2119578Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2120065Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2120533Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2121005Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2121537Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2121867Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:44.2122043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2122136Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2122268Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2122518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2123927Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2124018Z graph_break [] 2025-12-04T09:41:44.2124122Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2124302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2124394Z Autotune Choices Stats: 2025-12-04T09:41:44.2125236Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2125334Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2125421Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2130383Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2130912Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2131460Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2131936Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2132464Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2132941Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2133421Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2133907Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2134388Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2134875Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2135345Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2135727Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:44.2135822Z Autotune Choices Stats: 2025-12-04T09:41:44.2136680Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2136782Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2136866Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2136974Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2137511Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2138020Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2138490Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2138961Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2139434Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2139900Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2140369Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2140882Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2141356Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2141869Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2142204Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:44.2142388Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2142483Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2142614Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2142868Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2144250Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2144340Z graph_break [] 2025-12-04T09:41:44.2144445Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2144620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2144756Z Autotune Choices Stats: 2025-12-04T09:41:44.2145599Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2145694Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2145779Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2145883Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2146413Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2146889Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2147360Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2147854Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2148343Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2148817Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2149287Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2149759Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2150267Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2150741Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2151117Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:44.2151213Z Autotune Choices Stats: 2025-12-04T09:41:44.2152048Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2152144Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2152233Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2152339Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2152816Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2153291Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2153767Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2154284Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2154764Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2155228Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2155700Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2156212Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2156688Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2157160Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2157493Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:44.2157666Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2157764Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2157899Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2158144Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2159628Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2159721Z graph_break [] 2025-12-04T09:41:44.2159824Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2159999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2160129Z Autotune Choices Stats: 2025-12-04T09:41:44.2160964Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.2161060Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2161147Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2161256Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2161735Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2162206Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2162680Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2163157Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2163695Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2164174Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2164651Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2165212Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2165682Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2166148Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2166483Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:44.2166580Z Autotune Choices Stats: 2025-12-04T09:41:44.2167415Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2167524Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2167628Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2167750Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2168242Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2168757Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2169233Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2169703Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2170209Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2170675Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2171142Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2171616Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2172086Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2172563Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2172932Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:44.2173103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2173199Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2173331Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2173576Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2174986Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2175074Z graph_break [] 2025-12-04T09:41:44.2175186Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2175358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2175451Z Autotune Choices Stats: 2025-12-04T09:41:44.2176287Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2176382Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2176476Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2176579Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2177056Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2177533Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2178097Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2178582Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2179057Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2179581Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2180064Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2180545Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2181020Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2181487Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2181822Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:44.2181913Z Autotune Choices Stats: 2025-12-04T09:41:44.2182798Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.2182896Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2182981Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2183089Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2183560Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2184065Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2184532Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2185007Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2185478Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2185946Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2186431Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2186904Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2187420Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2187891Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2188223Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:44.2188439Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2188536Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2188667Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2188921Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2190307Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2190395Z graph_break [] 2025-12-04T09:41:44.2190503Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2190688Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2190783Z Autotune Choices Stats: 2025-12-04T09:41:44.2191631Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2191769Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2191854Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2191959Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2192451Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2192922Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2193429Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2193900Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2194373Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2194839Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2195308Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2195786Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2196255Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2196775Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2197108Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:44.2197199Z Autotune Choices Stats: 2025-12-04T09:41:44.2198097Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2198258Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2198345Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2198452Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2198933Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2199409Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2199927Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2200601Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2201153Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2201707Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2202175Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2202645Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2203178Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2203651Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2203986Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:44.2204161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2204256Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2204388Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2204633Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2205579Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2205661Z graph_break [] 2025-12-04T09:41:44.2205764Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2205942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2206033Z Autotune Choices Stats: 2025-12-04T09:41:44.2206912Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.2207007Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2207093Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2207260Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2207761Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2208257Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2208730Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2209197Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2209668Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2210141Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2210618Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2211138Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2211615Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2212084Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2212451Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:44.2212551Z Autotune Choices Stats: 2025-12-04T09:41:44.2213383Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2213484Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2213571Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2213674Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2214152Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2214624Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2215092Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2215557Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2216061Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2216531Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2217048Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2217523Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2218003Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2218476Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2218807Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:44.2218901Z Autotune Choices Stats: 2025-12-04T09:41:44.2219736Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2219870Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2219961Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2220072Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2220545Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2221020Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2221530Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2222014Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2222491Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2222971Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2223447Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2223928Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2224413Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2224933Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2225264Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:44.2225435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2225529Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2225705Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2225951Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2226899Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2226985Z graph_break [] 2025-12-04T09:41:44.2227089Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2227267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2227359Z Autotune Choices Stats: 2025-12-04T09:41:44.2228257Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2228353Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2228441Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2228550Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2229029Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2229551Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2230027Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2230503Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2231017Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2231490Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2231964Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2232427Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2232893Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2233361Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2233690Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.2233788Z Autotune Choices Stats: 2025-12-04T09:41:44.2234675Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2234780Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2234866Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2234970Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2235488Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2235959Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2236434Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2236904Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2237414Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2237888Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2238353Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2238869Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2239340Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2239862Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2240233Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:44.2240328Z Autotune Choices Stats: 2025-12-04T09:41:44.2241163Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2241259Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2241351Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2241463Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2241940Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2242419Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2242891Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2243372Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2243885Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2244361Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2244885Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2245359Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2245830Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2246299Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2246632Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:44.2246808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2246902Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2247036Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2247280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2248262Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2248348Z graph_break [] 2025-12-04T09:41:44.2248453Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2248632Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2248723Z Autotune Choices Stats: 2025-12-04T09:41:44.2249595Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2249695Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2249781Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2249889Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2250367Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2250843Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2251320Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2251801Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2252286Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2252800Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2253271Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2253746Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2254252Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2254720Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2255051Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.2255153Z Autotune Choices Stats: 2025-12-04T09:41:44.2255981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.2256080Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2256169Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2256275Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2256746Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2257293Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2257772Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2258243Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2258745Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2259214Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2259681Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2260157Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2260628Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2261109Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2261432Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:44.2261526Z Autotune Choices Stats: 2025-12-04T09:41:44.2262410Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2262503Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2262589Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2262700Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2263178Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2263696Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2264260Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2264733Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2265262Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2265765Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2266279Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2266830Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2267348Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2267832Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2268162Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:44.2268480Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2268577Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2268709Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2268962Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2270332Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2270422Z graph_break [] 2025-12-04T09:41:44.2270527Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2270702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2270797Z Autotune Choices Stats: 2025-12-04T09:41:44.2271664Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2271768Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2271899Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2272004Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2272478Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2272989Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2273470Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2273943Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2274412Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2274883Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2275359Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2275833Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2276343Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2276819Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2277151Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:44.2277246Z Autotune Choices Stats: 2025-12-04T09:41:44.2278127Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2278225Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2278320Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2278424Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2278901Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2279372Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2279885Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2280400Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2280870Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2281382Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2281851Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2282358Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2282834Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2283307Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2283639Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:44.2283813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2283909Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2284042Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2284291Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2285662Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2285788Z graph_break [] 2025-12-04T09:41:44.2285893Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2286071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2286164Z Autotune Choices Stats: 2025-12-04T09:41:44.2287000Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.2287158Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2287249Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2287378Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2287858Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2288333Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2288805Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2289270Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2289748Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2290221Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2290736Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2291207Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2291717Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2292191Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2292525Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:44.2292621Z Autotune Choices Stats: 2025-12-04T09:41:44.2293447Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2293547Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2293633Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2293740Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2294218Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2294684Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2295192Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2295662Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2296129Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2296637Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2297103Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2297604Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2298103Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2298576Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2298908Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.2299080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2299180Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2299310Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2299556Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2301248Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2301391Z graph_break [] 2025-12-04T09:41:44.2301500Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2301675Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2301768Z Autotune Choices Stats: 2025-12-04T09:41:44.2302612Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2302706Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2302796Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2302901Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2303375Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2303856Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2304334Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2304904Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2305385Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2305858Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2306387Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2306858Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2307369Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2307850Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2308184Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:44.2308280Z Autotune Choices Stats: 2025-12-04T09:41:44.2309113Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2309213Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2309299Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2309407Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2309922Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2310390Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2310903Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2311368Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2311841Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2312307Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2312772Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2313245Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2313723Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2314287Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2314613Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:44.2314788Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2314883Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2315016Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2315306Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2316679Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2316770Z graph_break [] 2025-12-04T09:41:44.2316875Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2317050Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2317146Z Autotune Choices Stats: 2025-12-04T09:41:44.2318027Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2318124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2318210Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2318316Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2318791Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2319310Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2319834Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2320353Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2320829Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2321312Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2321776Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2322248Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2322716Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2323188Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2323554Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.2323649Z Autotune Choices Stats: 2025-12-04T09:41:44.2324486Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2324580Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2324671Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2324812Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2325290Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2325767Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2326238Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2326712Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2327193Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2327701Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2328166Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2328676Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2329152Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2329662Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2329996Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:44.2330171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2330264Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2330397Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2330645Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2331588Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2331674Z graph_break [] 2025-12-04T09:41:44.2331780Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2331957Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2332050Z Autotune Choices Stats: 2025-12-04T09:41:44.2332939Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.2333033Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2333118Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2333223Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2333701Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2334209Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2334676Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2335156Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2335626Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2336102Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2336575Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2337048Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2337555Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2338026Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2338359Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:44.2338520Z Autotune Choices Stats: 2025-12-04T09:41:44.2339355Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2339455Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2339540Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2339644Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2340119Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2340589Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2341060Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2341525Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2342030Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2342503Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2342974Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2343488Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2343961Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2344440Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2344767Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:44.2344857Z Autotune Choices Stats: 2025-12-04T09:41:44.2345694Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2345789Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2345879Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2345990Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2346462Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2346977Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2347480Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2348019Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2348496Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2348976Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2349455Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2349945Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2350424Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2350893Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2351265Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:44.2351440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2351535Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2351668Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2351913Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2353327Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2353416Z graph_break [] 2025-12-04T09:41:44.2353521Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2353700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2353793Z Autotune Choices Stats: 2025-12-04T09:41:44.2354642Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.2354737Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2354825Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2354938Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2355416Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2355895Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2356404Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2356879Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2357402Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2357888Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2358387Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2358854Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2359331Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2359939Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2360325Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:44.2360468Z Autotune Choices Stats: 2025-12-04T09:41:44.2361469Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2361568Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2361655Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2361762Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2362332Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2362932Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2363496Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2364049Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2364599Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2365155Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2365703Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2366266Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2366857Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2367417Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2367848Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:44.2368044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2368144Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2368282Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2368570Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2370285Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2370371Z graph_break [] 2025-12-04T09:41:44.2370482Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2370678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2370775Z Autotune Choices Stats: 2025-12-04T09:41:44.2371773Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2371908Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2372006Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2372113Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2372682Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2373287Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2373839Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2374402Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2374968Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2375521Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2376078Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2376631Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2377182Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2377845Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2378235Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:44.2378328Z Autotune Choices Stats: 2025-12-04T09:41:44.2379373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2379469Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2379561Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2379677Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2380245Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2380801Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2381350Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2381902Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2382497Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2383049Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2383601Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2384159Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2384755Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2385315Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2385704Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:44.2385902Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2385999Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2386142Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2386426Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2388247Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2388338Z graph_break [] 2025-12-04T09:41:44.2388444Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2388687Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2388781Z Autotune Choices Stats: 2025-12-04T09:41:44.2389773Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2389915Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2390003Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2390113Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2390673Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2391242Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2391810Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2392372Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2392947Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2393536Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2394093Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2394652Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2395247Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2395803Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2396234Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:44.2396330Z Autotune Choices Stats: 2025-12-04T09:41:44.2397332Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2397428Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2397523Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2397632Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2398201Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2398760Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2399357Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2399896Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2400498Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2401039Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2401508Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2401986Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2402464Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2402933Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2403269Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:44.2403440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2403604Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2403732Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2403977Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2405404Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2405489Z graph_break [] 2025-12-04T09:41:44.2405659Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2405834Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2405925Z Autotune Choices Stats: 2025-12-04T09:41:44.2406768Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2406859Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2406945Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2407049Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2407536Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2408049Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2408512Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2409040Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2409509Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2409978Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2410490Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2410961Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2411440Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2411916Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2412249Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:44.2412341Z Autotune Choices Stats: 2025-12-04T09:41:44.2413195Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2413336Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2413421Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2413526Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2414001Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2414476Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2415020Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2420501Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2421009Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2421482Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2421953Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2422434Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2422909Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2423451Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2423796Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:44.2423975Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2424075Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2424256Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2424507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2425447Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2425540Z graph_break [] 2025-12-04T09:41:44.2425647Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2425828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2425923Z Autotune Choices Stats: 2025-12-04T09:41:44.2426755Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2426858Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2426947Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2427060Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2427538Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2428072Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2428545Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2429018Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2429541Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2430079Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2430561Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2431031Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2431503Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2431975Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2432310Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:44.2432414Z Autotune Choices Stats: 2025-12-04T09:41:44.2433280Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2433376Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2433468Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2433614Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2434086Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2434559Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2435026Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2435501Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2435965Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2436437Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2436899Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2437415Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2437937Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2438409Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2438786Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:44.2438882Z Autotune Choices Stats: 2025-12-04T09:41:44.2439784Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2439885Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2439973Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2440090Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2440563Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2441055Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2441525Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2441995Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2442516Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2442991Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2443513Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2443988Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2444479Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2444949Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2445278Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:44.2445458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2445557Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2445693Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2445941Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2447412Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2447504Z graph_break [] 2025-12-04T09:41:44.2447634Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2447839Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2447932Z Autotune Choices Stats: 2025-12-04T09:41:44.2448817Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2448920Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2449008Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2449118Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2449591Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2450065Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2450541Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2451011Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2451485Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2451996Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2452473Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2452985Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2453458Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2453933Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2454266Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:44.2454361Z Autotune Choices Stats: 2025-12-04T09:41:44.2455196Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2455295Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2455390Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2455494Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2456042Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2456513Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2456984Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2457497Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2457964Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2458432Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2458902Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2459378Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2459853Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2460324Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2460658Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:44.2460881Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2460984Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2461116Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2461362Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2462308Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2462431Z graph_break [] 2025-12-04T09:41:44.2462540Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2462717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2462810Z Autotune Choices Stats: 2025-12-04T09:41:44.2463639Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2463736Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2463823Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2463933Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2464410Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2464889Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2465409Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2465881Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2466357Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2466875Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2467362Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2467876Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2468371Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2468835Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2469172Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:44.2469265Z Autotune Choices Stats: 2025-12-04T09:41:44.2470091Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2470232Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2470320Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2470424Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2470899Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2471453Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2471919Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2472395Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2472865Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2473327Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2473795Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2474279Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2474792Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2475267Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2475595Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.2475692Z Autotune Choices Stats: 2025-12-04T09:41:44.2476555Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:44.2476654Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2476744Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2476859Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2477334Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2477806Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2478327Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2478798Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2479266Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2479892Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2480361Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2480873Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2481351Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2481825Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2482158Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.2482378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2482479Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2482611Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2482861Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2484241Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2484368Z graph_break [] 2025-12-04T09:41:44.2484479Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2484656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2484751Z Autotune Choices Stats: 2025-12-04T09:41:44.2485619Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2485717Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2485806Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2485919Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2486397Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2486884Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2487356Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2487838Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2488306Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2488775Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2489289Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2489759Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2490303Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2490770Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2491111Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:44.2491207Z Autotune Choices Stats: 2025-12-04T09:41:44.2492050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2492152Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2492244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2492351Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2492836Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2493309Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2493824Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2494291Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2494762Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2495270Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2495741Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2496221Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2496691Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2497171Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2497506Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:44.2497710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2497833Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2497972Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2498269Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2499648Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2499780Z graph_break [] 2025-12-04T09:41:44.2499888Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2500062Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2500163Z Autotune Choices Stats: 2025-12-04T09:41:44.2501146Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2501246Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2501336Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2501447Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2501921Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2502402Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2502882Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2503434Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2503904Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2504374Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2504904Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2505375Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2505848Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2506327Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2506658Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:44.2506756Z Autotune Choices Stats: 2025-12-04T09:41:44.2507608Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2507719Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2507822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2507992Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2508471Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2508949Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2509470Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2509940Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2510415Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2510878Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2511344Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2511822Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2512298Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2512811Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2513144Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:44.2513321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2513420Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2513558Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2513841Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2515220Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2515309Z graph_break [] 2025-12-04T09:41:44.2515413Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2515594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2515690Z Autotune Choices Stats: 2025-12-04T09:41:44.2516545Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.2516644Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2516735Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2516847Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2517358Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2517826Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2518300Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2518810Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2519282Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2519801Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2520280Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2520752Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2521231Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2521748Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2522082Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:44.2522179Z Autotune Choices Stats: 2025-12-04T09:41:44.2523014Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2523118Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2523247Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2523354Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2523837Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2524316Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2524788Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2525258Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2525727Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2526197Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2526729Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2527204Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2527677Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2528269Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2528604Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:44.2528783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2528882Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2529018Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2529263Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2530208Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2530298Z graph_break [] 2025-12-04T09:41:44.2530412Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2530586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2530724Z Autotune Choices Stats: 2025-12-04T09:41:44.2531552Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2531650Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2531743Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2531847Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2532321Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2532836Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2533310Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2533792Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2534265Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2534738Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2535214Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2535683Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2536196Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2536662Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2537035Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.2537129Z Autotune Choices Stats: 2025-12-04T09:41:44.2538020Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2538124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2538212Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2538318Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2538786Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2539257Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2539729Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2540199Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2540710Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2541175Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2541643Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2542209Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2542680Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2543161Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2543491Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:44.2543594Z Autotune Choices Stats: 2025-12-04T09:41:44.2544419Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2544518Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2544610Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2544727Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2545205Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2545709Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2546175Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2546688Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2547152Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2547620Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2548092Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2548558Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2549030Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2549497Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2549871Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:44.2550045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2550146Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2550275Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2550518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2551497Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2551581Z graph_break [] 2025-12-04T09:41:44.2551685Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2551858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2551947Z Autotune Choices Stats: 2025-12-04T09:41:44.2552787Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.2552880Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2552965Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2553073Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2553550Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2554028Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2554534Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2555006Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2555526Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2556041Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2556507Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2556979Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2557438Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2557945Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2558290Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:44.2558386Z Autotune Choices Stats: 2025-12-04T09:41:44.2559212Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2559347Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2559438Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2559593Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2560063Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2560606Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2561081Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2561557Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2562077Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2562546Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2563011Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2563484Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2564000Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2564474Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2564807Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:44.2564937Z Autotune Choices Stats: 2025-12-04T09:41:44.2565785Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2565878Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2565965Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2566077Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2566547Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2567023Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2567500Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2568028Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2568556Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2569029Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2569500Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2570015Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2570494Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2571007Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2571336Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:44.2571509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2571601Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2571735Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2571979Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2573383Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2573516Z graph_break [] 2025-12-04T09:41:44.2573620Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2573795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2573886Z Autotune Choices Stats: 2025-12-04T09:41:44.2574715Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.2574854Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2574939Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2575048Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2575523Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2575990Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2576466Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2576940Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2577414Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2577930Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2578398Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2578867Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2579370Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2579845Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2580177Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:44.2580274Z Autotune Choices Stats: 2025-12-04T09:41:44.2581104Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2581198Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2581287Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2581391Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2581867Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2582335Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2582842Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2583313Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2583819Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2584295Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2584769Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2585244Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2585712Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2586183Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2586512Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:44.2586722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2586820Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2586949Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2587198Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2588191Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2588276Z graph_break [] 2025-12-04T09:41:44.2588382Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2588592Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2588684Z Autotune Choices Stats: 2025-12-04T09:41:44.2589528Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2589625Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2589712Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2589821Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2590295Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2590783Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2591441Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2591924Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2592442Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2592905Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2593414Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2593875Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2594347Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2594816Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2595150Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.2595245Z Autotune Choices Stats: 2025-12-04T09:41:44.2596070Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:44.2596231Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2596317Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2596419Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2596897Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2597481Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2598074Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2598543Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2599012Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2599531Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2599993Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2600609Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2601076Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2601548Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2601952Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:44.2602048Z Autotune Choices Stats: 2025-12-04T09:41:44.2602945Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2603166Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2603257Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2603365Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2603835Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2604324Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2604786Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2605252Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2605724Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2606268Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2606734Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2607201Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2607721Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2608184Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2608514Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:44.2608688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2608785Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2608913Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2609156Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2610559Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2610646Z graph_break [] 2025-12-04T09:41:44.2610754Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2610926Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2611015Z Autotune Choices Stats: 2025-12-04T09:41:44.2611905Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2611999Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2612123Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2612232Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2612710Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2613177Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2613641Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2614112Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2614584Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2615058Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2615572Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2616042Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2616513Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2617017Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2617356Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:44.2617446Z Autotune Choices Stats: 2025-12-04T09:41:44.2618336Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2618434Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2618518Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2618621Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2619100Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2619633Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2620112Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2620676Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2621143Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2621605Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2622112Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2622588Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2623110Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2623582Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2623906Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:44.2624083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2624176Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2624306Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2624552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2625534Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2625615Z graph_break [] 2025-12-04T09:41:44.2625719Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2625891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2625988Z Autotune Choices Stats: 2025-12-04T09:41:44.2626862Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2626958Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2627046Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2627150Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2627665Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2628175Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2628638Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2629104Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2629568Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2630079Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2630552Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2631097Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2631560Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2632023Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2632360Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:44.2632451Z Autotune Choices Stats: 2025-12-04T09:41:44.2633286Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2633383Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2633467Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2633572Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2634044Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2634565Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2635035Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2635498Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2636004Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2636469Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2636939Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2637406Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2637884Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2638361Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2638691Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:44.2638787Z Autotune Choices Stats: 2025-12-04T09:41:44.2639715Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.2639814Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2639900Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2640048Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2640521Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2640983Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2641455Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2641922Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2642395Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2642871Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2643341Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2643856Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2644322Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2644788Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2645155Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:44.2645329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2645428Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2645561Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2645812Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2647188Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2647273Z graph_break [] 2025-12-04T09:41:44.2647379Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2647560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2647670Z Autotune Choices Stats: 2025-12-04T09:41:44.2648574Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2648668Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2648758Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2648862Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2649343Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2649853Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2650322Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2650803Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2651271Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2651742Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2652217Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2652690Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2653197Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2653659Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2653994Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.2654085Z Autotune Choices Stats: 2025-12-04T09:41:44.2654972Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2655068Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2655151Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2655257Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2655732Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2656203Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2656674Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2657137Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2657615Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2658156Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2658623Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2659131Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2659608Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2660080Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2660408Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:44.2660588Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2660681Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2660818Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2661062Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2662005Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2662131Z graph_break [] 2025-12-04T09:41:44.2662233Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2662410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2662501Z Autotune Choices Stats: 2025-12-04T09:41:44.2663382Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2663480Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2663605Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2663710Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2664199Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2664665Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2665137Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2665605Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2666077Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2666550Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2667080Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2667557Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2668020Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2668527Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2668860Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:44.2668952Z Autotune Choices Stats: 2025-12-04T09:41:44.2669778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2669871Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2669958Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2670065Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2670536Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2671009Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2671522Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2671991Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2672456Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2672961Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2673424Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2673896Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2674368Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2674834Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2675168Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:44.2675258Z Autotune Choices Stats: 2025-12-04T09:41:44.2676103Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.2676236Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2676322Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2676435Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2676909Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2677415Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2677929Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2678399Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2678876Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2679347Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2679877Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2680348Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2680867Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2681332Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2681661Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:44.2681841Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2681936Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2682109Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2682355Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2683297Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2683384Z graph_break [] 2025-12-04T09:41:44.2683486Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2683656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2683748Z Autotune Choices Stats: 2025-12-04T09:41:44.2684578Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2684676Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2684759Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2684864Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2685387Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2685860Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2686335Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2686838Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2687307Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2687810Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2688287Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2688751Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2689217Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2689688Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2690056Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:44.2690149Z Autotune Choices Stats: 2025-12-04T09:41:44.2690982Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2691077Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2691165Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2691307Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2691774Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2692254Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2692724Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2693188Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2693652Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2694117Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2694579Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2695089Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2695566Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2696074Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2696406Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:44.2696498Z Autotune Choices Stats: 2025-12-04T09:41:44.2697382Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:44.2697476Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2702497Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2702633Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2703156Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2703644Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2704236Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2704718Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2705187Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2705745Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2706220Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2706691Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2707170Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2707664Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2708022Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:44.2708203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2708300Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2708434Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2708682Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2710126Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2710262Z graph_break [] 2025-12-04T09:41:44.2710367Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2710547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2710642Z Autotune Choices Stats: 2025-12-04T09:41:44.2711482Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2711580Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2711671Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2711781Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2712257Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2712738Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2713218Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2713744Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2714216Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2714689Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2715202Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2715670Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2716142Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2716614Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2716950Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:44.2717052Z Autotune Choices Stats: 2025-12-04T09:41:44.2717944Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:44.2718043Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2718129Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2718235Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2718757Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2719231Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2719770Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2720284Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2720750Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2721272Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2721741Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2722220Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2722690Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2723208Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2723537Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:44.2723710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2723812Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2723948Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2724201Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2725182Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2725271Z graph_break [] 2025-12-04T09:41:44.2725380Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2725553Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2725652Z Autotune Choices Stats: 2025-12-04T09:41:44.2726485Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2726584Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2726673Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2726783Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2727257Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2727739Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2728252Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2728729Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2729252Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2729728Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2730198Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2730666Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2731132Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2731600Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2731937Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.2732068Z Autotune Choices Stats: 2025-12-04T09:41:44.2732917Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2733016Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2733104Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2733213Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2733685Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2734202Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2734681Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2735154Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2735622Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2736093Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2736561Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2737035Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2737551Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2738087Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2738461Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:44.2738560Z Autotune Choices Stats: 2025-12-04T09:41:44.2739386Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.2739486Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2739574Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2739689Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2740169Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2740641Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2741116Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2741581Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2742117Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2742595Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2743059Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2743576Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2744047Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2744520Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2744846Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:44.2745019Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2745120Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2745253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2745499Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2746903Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2746991Z graph_break [] 2025-12-04T09:41:44.2747100Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2747273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2747367Z Autotune Choices Stats: 2025-12-04T09:41:44.2748300Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2748396Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2748490Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2748597Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2749081Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2749559Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2750036Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2750527Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2751034Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2751509Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2751977Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2752446Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2752951Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2753429Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2753765Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:44.2753858Z Autotune Choices Stats: 2025-12-04T09:41:44.2754711Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2754809Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2754902Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2755011Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2755484Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2755996Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2756472Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2756936Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2757446Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2757912Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2758384Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2758854Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2759326Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2759860Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2760236Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:44.2760411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2760507Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2760646Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2760896Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2762302Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2762393Z graph_break [] 2025-12-04T09:41:44.2762501Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2762682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2762776Z Autotune Choices Stats: 2025-12-04T09:41:44.2763606Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3249", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2763705Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2763794Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2763901Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2764378Z triton_mm_3249 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2764866Z triton_mm_3251 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2765395Z triton_mm_3252 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2765866Z triton_mm_3253 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2766342Z triton_mm_3254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2766858Z triton_mm_3255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2767324Z triton_mm_3246 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2767849Z triton_mm_3244 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2768324Z triton_mm_3245 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2768793Z triton_mm_3247 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2769128Z SingleProcess AUTOTUNE benchmarking takes 0.2118 seconds and 0.6062 seconds precompiling for 15 choices 2025-12-04T09:41:44.2769222Z Autotune Choices Stats: 2025-12-04T09:41:44.2770060Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3275", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2770194Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2770286Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2770391Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2770876Z triton_mm_3275 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2771444Z triton_mm_3276 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2771912Z triton_mm_3277 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2772383Z triton_mm_3280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2772848Z triton_mm_3274 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2773316Z triton_mm_3279 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2773785Z triton_mm_3278 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2774256Z triton_mm_3283 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2774784Z triton_mm_3284 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2775253Z triton_mm_3281 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2775585Z SingleProcess AUTOTUNE benchmarking takes 0.1792 seconds and 1.8272 seconds precompiling for 13 choices 2025-12-04T09:41:44.2775826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2775925Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2776060Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2776306Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2777707Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2777804Z graph_break [] 2025-12-04T09:41:44.2777922Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2778097Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2778189Z Autotune Choices Stats: 2025-12-04T09:41:44.2779036Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.2779170Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2779258Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2779369Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2779850Z triton_mm_3300 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2780336Z triton_mm_3288 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2780843Z triton_mm_3289 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2781322Z triton_mm_3294 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2781799Z triton_mm_3295 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2782269Z triton_mm_3296 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2782737Z triton_mm_3287 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2783204Z triton_mm_3290 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2783672Z triton_mm_3291 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2784181Z triton_mm_3292 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2784517Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6016 seconds precompiling for 15 choices 2025-12-04T09:41:44.2784611Z Autotune Choices Stats: 2025-12-04T09:41:44.2785445Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3318", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028575999662280083, "best_triton_pos": 0} 2025-12-04T09:41:44.2785583Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2785670Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2785774Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2786258Z triton_mm_3318 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2786726Z triton_mm_3319 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2787199Z triton_mm_3320 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2787670Z triton_mm_3323 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2788136Z triton_mm_3317 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2788642Z triton_mm_3322 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2789109Z triton_mm_3321 0.0337 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2789584Z triton_mm_3324 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2790093Z triton_mm_3326 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2790570Z triton_mm_3327 0.0348 ms 82.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2790897Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8359 seconds precompiling for 13 choices 2025-12-04T09:41:44.2791075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2791173Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2791305Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2791560Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2792981Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2793072Z graph_break [] 2025-12-04T09:41:44.2793178Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2793354Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2793497Z Autotune Choices Stats: 2025-12-04T09:41:44.2794344Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3344", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.2794481Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2794574Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2794682Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2795169Z triton_mm_3344 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2795682Z triton_mm_3332 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2796154Z triton_mm_3333 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2796632Z triton_mm_3339 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2797112Z triton_mm_3341 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2797591Z triton_mm_3331 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2798320Z triton_mm_3338 0.0285 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2798964Z triton_mm_3330 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2799681Z triton_mm_3334 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2800697Z triton_mm_3335 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2801186Z SingleProcess AUTOTUNE benchmarking takes 0.2046 seconds and 0.6449 seconds precompiling for 15 choices 2025-12-04T09:41:44.2801318Z Autotune Choices Stats: 2025-12-04T09:41:44.2802462Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.2802598Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2802717Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2803539Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2804269Z triton_mm_3366 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2804879Z triton_mm_3360 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2805357Z triton_mm_3361 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2805965Z triton_mm_3362 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2806435Z triton_mm_3363 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2806967Z triton_mm_3365 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2807493Z triton_mm_3367 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2807961Z triton_mm_3364 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2808444Z triton_mm_3369 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2808915Z triton_mm_3370 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2809265Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.9040 seconds precompiling for 13 choices 2025-12-04T09:41:44.2809448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2809545Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2809682Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2809985Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2810927Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2811018Z graph_break [] 2025-12-04T09:41:44.2811123Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2811301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2811399Z Autotune Choices Stats: 2025-12-04T09:41:44.2812293Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3382", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.2812396Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2812485Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2812597Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2813077Z triton_mm_3382 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2813544Z triton_mm_3374 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2814074Z triton_mm_3379 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2814549Z triton_mm_3380 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2815027Z triton_mm_3381 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2815545Z triton_mm_3384 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2816018Z triton_mm_3373 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2816527Z triton_mm_3375 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2816995Z triton_mm_3376 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2817467Z triton_mm_3377 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2817798Z SingleProcess AUTOTUNE benchmarking takes 0.2041 seconds and 0.6036 seconds precompiling for 15 choices 2025-12-04T09:41:44.2817896Z Autotune Choices Stats: 2025-12-04T09:41:44.2818721Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3406", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02860799990594387, "best_triton_pos": 0} 2025-12-04T09:41:44.2818821Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2818911Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2819018Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2819539Z triton_mm_3406 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2820017Z triton_mm_3404 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2820482Z triton_mm_3405 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2820991Z triton_mm_3409 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2821457Z triton_mm_3403 0.0297 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2821926Z triton_mm_3408 0.0307 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2822392Z triton_mm_3407 0.0328 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2822874Z triton_mm_3410 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2823348Z triton_mm_3412 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2823819Z triton_mm_3413 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2824152Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8691 seconds precompiling for 13 choices 2025-12-04T09:41:44.2824247Z Autotune Choices Stats: 2025-12-04T09:41:44.2825122Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3437", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.2825218Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2825343Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2825464Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2825944Z triton_mm_3437 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2826466Z triton_mm_3438 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2826949Z triton_mm_3440 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2827423Z triton_mm_3442 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2827953Z triton_mm_3443 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2828428Z triton_mm_3430 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2828940Z triton_mm_3435 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2829406Z triton_mm_3429 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2829876Z triton_mm_3433 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2830388Z triton_mm_3436 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2830722Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6051 seconds precompiling for 15 choices 2025-12-04T09:41:44.2830903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2831004Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2831143Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2831392Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2832333Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2832423Z graph_break [] 2025-12-04T09:41:44.2832530Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2832709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2832804Z Autotune Choices Stats: 2025-12-04T09:41:44.2833633Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3462", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2833776Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2833865Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2833973Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2834451Z triton_mm_3462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2834981Z triton_mm_3466 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2835459Z triton_mm_3467 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2835942Z triton_mm_3472 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2836414Z triton_mm_3459 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2836881Z triton_mm_3460 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2837393Z triton_mm_3461 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2837872Z triton_mm_3463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2838402Z triton_mm_3465 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2838879Z triton_mm_3468 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2839207Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.5791 seconds precompiling for 15 choices 2025-12-04T09:41:44.2839307Z Autotune Choices Stats: 2025-12-04T09:41:44.2840319Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3489", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.2840418Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2840510Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2840619Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2841105Z triton_mm_3489 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2841600Z triton_mm_3490 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2842073Z triton_mm_3491 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2842549Z triton_mm_3492 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2843025Z triton_mm_3495 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2843529Z triton_mm_3494 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2844010Z triton_mm_3493 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2844532Z triton_mm_3498 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2845008Z triton_mm_3496 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2845486Z triton_mm_3499 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2845822Z SingleProcess AUTOTUNE benchmarking takes 0.1783 seconds and 1.8202 seconds precompiling for 13 choices 2025-12-04T09:41:44.2845916Z Autotune Choices Stats: 2025-12-04T09:41:44.2846740Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3516", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2846845Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2846934Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2847051Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2847534Z triton_mm_3516 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2848048Z triton_mm_3517 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2848521Z triton_mm_3520 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2848998Z triton_mm_3522 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2849516Z triton_mm_3524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2849996Z triton_mm_3525 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2850465Z triton_mm_3515 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2850934Z triton_mm_3518 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2851402Z triton_mm_3519 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2851871Z triton_mm_3521 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2852203Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.2852384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2852524Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2852662Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2852907Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2853846Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2853977Z graph_break [] 2025-12-04T09:41:44.2854083Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2854261Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2854364Z Autotune Choices Stats: 2025-12-04T09:41:44.2855196Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3547", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2855290Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2855382Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2855488Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2855972Z triton_mm_3547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2856450Z triton_mm_3552 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2856968Z triton_mm_3556 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2857442Z triton_mm_3548 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2857963Z triton_mm_3546 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2858533Z triton_mm_3555 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2859000Z triton_mm_3549 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2859471Z triton_mm_3550 0.0297 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2859935Z triton_mm_3545 0.0306 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2860409Z triton_mm_3551 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2860750Z SingleProcess AUTOTUNE benchmarking takes 0.2059 seconds and 0.6126 seconds precompiling for 15 choices 2025-12-04T09:41:44.2860847Z Autotune Choices Stats: 2025-12-04T09:41:44.2861669Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3577", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02860799990594387, "best_triton_pos": 0} 2025-12-04T09:41:44.2861765Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2861893Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2862004Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2862479Z triton_mm_3577 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2862952Z triton_mm_3576 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2863466Z triton_mm_3578 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2863938Z triton_mm_3581 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2864407Z triton_mm_3575 0.0297 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2864870Z triton_mm_3580 0.0307 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2865344Z triton_mm_3579 0.0328 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2865819Z triton_mm_3584 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2866335Z triton_mm_3585 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2866811Z triton_mm_3582 0.0348 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2867144Z SingleProcess AUTOTUNE benchmarking takes 0.1807 seconds and 1.8175 seconds precompiling for 13 choices 2025-12-04T09:41:44.2867238Z Autotune Choices Stats: 2025-12-04T09:41:44.2868156Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3603", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2868258Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2868350Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2868466Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2868943Z triton_mm_3603 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2869414Z triton_mm_3607 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2869897Z triton_mm_3612 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2870373Z triton_mm_3611 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2870839Z triton_mm_3606 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2871347Z triton_mm_3605 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2871819Z triton_mm_3608 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2872288Z triton_mm_3601 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2872832Z triton_mm_3602 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2873302Z triton_mm_3604 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2873632Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.2873811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2873909Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2874040Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2874294Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2875669Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2875801Z graph_break [] 2025-12-04T09:41:44.2875905Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2876085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2876185Z Autotune Choices Stats: 2025-12-04T09:41:44.2877018Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3638", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2877122Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2877255Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2877363Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2877852Z triton_mm_3638 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2878328Z triton_mm_3639 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2878803Z triton_mm_3642 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2879270Z triton_mm_3634 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2879838Z triton_mm_3636 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2880322Z triton_mm_3644 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2880843Z triton_mm_3631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2881313Z triton_mm_3633 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2881776Z triton_mm_3637 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2882291Z triton_mm_3640 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2882627Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6077 seconds precompiling for 15 choices 2025-12-04T09:41:44.2882722Z Autotune Choices Stats: 2025-12-04T09:41:44.2883559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3667", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.2883654Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2883741Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2883854Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2884328Z triton_mm_3667 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2884797Z triton_mm_3664 0.0286 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2885312Z triton_mm_3662 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2885781Z triton_mm_3663 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2886246Z triton_mm_3661 0.0297 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2886751Z triton_mm_3666 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2887222Z triton_mm_3665 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2887737Z triton_mm_3670 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2888222Z triton_mm_3671 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2888692Z triton_mm_3668 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2889029Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8616 seconds precompiling for 13 choices 2025-12-04T09:41:44.2889203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2889301Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2889437Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2889682Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2891086Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2891213Z graph_break [] 2025-12-04T09:41:44.2891320Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2891497Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2891590Z Autotune Choices Stats: 2025-12-04T09:41:44.2892420Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3676", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2892518Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2892606Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2892717Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2893192Z triton_mm_3676 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2893663Z triton_mm_3678 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2894146Z triton_mm_3682 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2894674Z triton_mm_3688 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2895143Z triton_mm_3674 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2895613Z triton_mm_3675 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2896125Z triton_mm_3677 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2896597Z triton_mm_3679 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2897065Z triton_mm_3680 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2897591Z triton_mm_3681 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2897919Z SingleProcess AUTOTUNE benchmarking takes 0.2086 seconds and 0.6227 seconds precompiling for 15 choices 2025-12-04T09:41:44.2898017Z Autotune Choices Stats: 2025-12-04T09:41:44.2898857Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3705", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027744000777602196, "best_triton_pos": 0} 2025-12-04T09:41:44.2898953Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2899047Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2899151Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2899668Z triton_mm_3705 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2900135Z triton_mm_3707 0.0286 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2900919Z triton_mm_3704 0.0287 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2901400Z triton_mm_3706 0.0287 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2901871Z triton_mm_3710 0.0287 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2902343Z triton_mm_3709 0.0307 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2902807Z triton_mm_3708 0.0328 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2903288Z triton_mm_3714 0.0347 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2903759Z triton_mm_3711 0.0348 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2904318Z triton_mm_3713 0.0348 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2904652Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8239 seconds precompiling for 13 choices 2025-12-04T09:41:44.2904874Z ______________ TestPatternMatcher.test_mixed_mm_exhaustive_dtypes ______________ 2025-12-04T09:41:44.2904981Z Traceback (most recent call last): 2025-12-04T09:41:44.2905398Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 395, in test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.2905645Z self._test_mixed_impl(fn, args, True, False, rtol=0.16, atol=1e-4) 2025-12-04T09:41:44.2906005Z File "/var/lib/jenkins/workspace/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:41:44.2906192Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:41:44.2906359Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.2906444Z Searched string: 2025-12-04T09:41:44.2906577Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.2906586Z 2025-12-04T09:41:44.2906707Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.2906712Z 2025-12-04T09:41:44.2906841Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.2906965Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.2906981Z 2025-12-04T09:41:44.2907098Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.2907199Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.2907312Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.2907406Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.2907411Z 2025-12-04T09:41:44.2907498Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.2907592Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.2907688Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.2907779Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.2907783Z 2025-12-04T09:41:44.2907790Z 2025-12-04T09:41:44.2908031Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.2908036Z 2025-12-04T09:41:44.2908040Z 2025-12-04T09:41:44.2908168Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.2908291Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.2908404Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.2908491Z idx_m = rm[:, None] 2025-12-04T09:41:44.2908634Z idx_n = rn[None, :] 2025-12-04T09:41:44.2908729Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.2908734Z 2025-12-04T09:41:44.2908839Z # inductor generates a suffix 2025-12-04T09:41:44.2908930Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.2909141Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.2909235Z ''', device_str='cuda') 2025-12-04T09:41:44.2909239Z 2025-12-04T09:41:44.2909243Z 2025-12-04T09:41:44.2909343Z async_compile.wait(globals()) 2025-12-04T09:41:44.2909426Z del async_compile 2025-12-04T09:41:44.2909430Z 2025-12-04T09:41:44.2909514Z class Runner: 2025-12-04T09:41:44.2909619Z def __init__(self, partitions): 2025-12-04T09:41:44.2909726Z self.partitions = partitions 2025-12-04T09:41:44.2909731Z 2025-12-04T09:41:44.2909844Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.2909936Z new_callables = [] 2025-12-04T09:41:44.2910059Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.2910167Z new_callables.append(fn(c)) 2025-12-04T09:41:44.2910269Z self.partitions = new_callables 2025-12-04T09:41:44.2910275Z 2025-12-04T09:41:44.2910371Z def call(self, args): 2025-12-04T09:41:44.2910461Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.2910544Z args.clear() 2025-12-04T09:41:44.2910681Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.2910853Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.2910966Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.2911063Z torch.cuda.set_device(0) 2025-12-04T09:41:44.2911234Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.2911459Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.2911560Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.2911750Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.2911842Z del arg0_1 2025-12-04T09:41:44.2912045Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.2912310Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.2912410Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.2912635Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.2912720Z del arg1_1 2025-12-04T09:41:44.2912800Z del buf0 2025-12-04T09:41:44.2912885Z return (buf1, ) 2025-12-04T09:41:44.2912891Z 2025-12-04T09:41:44.2912997Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.2913082Z call = runner.call 2025-12-04T09:41:44.2913244Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.2913253Z 2025-12-04T09:41:44.2913257Z 2025-12-04T09:41:44.2913395Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.2913531Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.2913682Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.2913887Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.2914090Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.2914198Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.2914360Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.2914365Z 2025-12-04T09:41:44.2914369Z 2025-12-04T09:41:44.2914465Z if __name__ == "__main__": 2025-12-04T09:41:44.2914710Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.2914871Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.2914959Z From CHECK: .to( 2025-12-04T09:41:44.2914964Z 2025-12-04T09:41:44.2914967Z 2025-12-04T09:41:44.2915139Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.2915743Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.2915748Z 2025-12-04T09:41:44.2915966Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.2916146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2916247Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2916381Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2916632Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2918061Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2918147Z graph_break [] 2025-12-04T09:41:44.2918254Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2918430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2918570Z Autotune Choices Stats: 2025-12-04T09:41:44.2919412Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_28", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2919587Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2919680Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2919789Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2920276Z triton_mm_28 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2920783Z triton_mm_15 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2921265Z triton_mm_16 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2921728Z triton_mm_17 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2922184Z triton_mm_20 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2922663Z triton_mm_29 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2923117Z triton_mm_18 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2923576Z triton_mm_19 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2924073Z triton_mm_21 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2924538Z triton_mm_22 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2924917Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.2925015Z Autotune Choices Stats: 2025-12-04T09:41:44.2925864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_46", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2925962Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2926048Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2926163Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2926636Z triton_mm_46 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2927094Z triton_mm_48 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2927558Z triton_mm_45 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2928063Z triton_mm_47 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2928566Z triton_mm_50 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2929021Z triton_mm_51 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2929484Z triton_mm_49 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2929990Z triton_mm_54 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2930455Z triton_mm_55 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2930926Z triton_mm_52 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2931259Z SingleProcess AUTOTUNE benchmarking takes 0.1787 seconds and 1.7847 seconds precompiling for 13 choices 2025-12-04T09:41:44.2931437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2931539Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2931673Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2931926Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2932867Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2932959Z graph_break [] 2025-12-04T09:41:44.2933108Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2933285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2933384Z Autotune Choices Stats: 2025-12-04T09:41:44.2934206Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2934345Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2934434Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2934542Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2935020Z triton_mm_797 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2935493Z triton_mm_798 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2935972Z triton_mm_799 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2936450Z triton_mm_801 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2936935Z triton_mm_804 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2937510Z triton_mm_806 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2937975Z triton_mm_793 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2938448Z triton_mm_794 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2938984Z triton_mm_803 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2939451Z triton_mm_795 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2939785Z SingleProcess AUTOTUNE benchmarking takes 0.2012 seconds and 0.6194 seconds precompiling for 15 choices 2025-12-04T09:41:44.2939881Z Autotune Choices Stats: 2025-12-04T09:41:44.2940706Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_825", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2940803Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2940898Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2941006Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2941480Z triton_mm_825 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2941954Z triton_mm_826 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2942465Z triton_mm_824 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2942933Z triton_mm_823 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2943392Z triton_mm_829 0.0298 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2943897Z triton_mm_828 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2944361Z triton_mm_827 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2944833Z triton_mm_830 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2945305Z triton_mm_832 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2945771Z triton_mm_833 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2946112Z SingleProcess AUTOTUNE benchmarking takes 0.1776 seconds and 1.7676 seconds precompiling for 13 choices 2025-12-04T09:41:44.2946204Z Autotune Choices Stats: 2025-12-04T09:41:44.2947050Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.2947215Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2947310Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2947444Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2947922Z triton_mm_858 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2948426Z triton_mm_852 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2948893Z triton_mm_854 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2949363Z triton_mm_856 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2949840Z triton_mm_859 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2950308Z triton_mm_861 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.2950778Z triton_mm_860 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2951250Z triton_mm_862 0.0277 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2951759Z triton_mm_857 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2952226Z triton_mm_849 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2952554Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6076 seconds precompiling for 15 choices 2025-12-04T09:41:44.2952771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2952874Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2953008Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2953257Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2954198Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2954285Z graph_break [] 2025-12-04T09:41:44.2954391Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2954565Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2954661Z Autotune Choices Stats: 2025-12-04T09:41:44.2955490Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_880", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.2955592Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2955722Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2955827Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2956311Z triton_mm_880 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2956773Z triton_mm_879 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2957257Z triton_mm_882 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2957788Z triton_mm_884 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2958263Z triton_mm_886 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2958743Z triton_mm_887 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2959211Z triton_mm_888 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2959729Z triton_mm_889 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2960245Z triton_mm_881 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2960711Z triton_mm_883 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2961087Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.5951 seconds precompiling for 15 choices 2025-12-04T09:41:44.2961182Z Autotune Choices Stats: 2025-12-04T09:41:44.2962017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_912", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027488000690937042, "best_triton_pos": 0} 2025-12-04T09:41:44.2962153Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2962244Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2962354Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2962829Z triton_mm_912 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2963303Z triton_mm_910 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2963763Z triton_mm_911 0.0276 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2964226Z triton_mm_909 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2964690Z triton_mm_915 0.0287 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2970260Z triton_mm_914 0.0307 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2971501Z triton_mm_913 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2971978Z triton_mm_917 0.0328 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2972452Z triton_mm_916 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2972964Z triton_mm_918 0.0338 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2973301Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.7971 seconds precompiling for 13 choices 2025-12-04T09:41:44.2973396Z Autotune Choices Stats: 2025-12-04T09:41:44.2974253Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_943", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2974346Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2974430Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2974544Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.2975026Z triton_mm_943 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2975500Z triton_mm_946 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2975985Z triton_mm_949 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2976489Z triton_mm_936 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2976953Z triton_mm_937 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2977504Z triton_mm_938 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2977967Z triton_mm_939 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2978426Z triton_mm_940 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2978887Z triton_mm_941 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2979356Z triton_mm_942 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2979687Z SingleProcess AUTOTUNE benchmarking takes 0.1996 seconds and 0.6088 seconds precompiling for 15 choices 2025-12-04T09:41:44.2979872Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2979965Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2980096Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2980413Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2981798Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2981887Z graph_break [] 2025-12-04T09:41:44.2981992Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2982207Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2982304Z Autotune Choices Stats: 2025-12-04T09:41:44.2983153Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_979", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.2983255Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2983342Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2983447Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2983931Z triton_mm_979 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2984394Z triton_mm_965 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2984856Z triton_mm_968 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2985319Z triton_mm_970 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2985822Z triton_mm_971 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2986289Z triton_mm_972 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2986798Z triton_mm_973 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2987271Z triton_mm_976 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2987760Z triton_mm_966 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2988253Z triton_mm_967 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2988583Z SingleProcess AUTOTUNE benchmarking takes 0.1992 seconds and 0.6124 seconds precompiling for 15 choices 2025-12-04T09:41:44.2988675Z Autotune Choices Stats: 2025-12-04T09:41:44.2989508Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.2989602Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2989739Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2989844Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.2990328Z triton_mm_996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2990798Z triton_mm_997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2991269Z triton_mm_1001 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.2991776Z triton_mm_998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.2992241Z triton_mm_995 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.2992711Z triton_mm_999 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2993173Z triton_mm_1000 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2993643Z triton_mm_1004 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2994127Z triton_mm_1005 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2994597Z triton_mm_1002 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.2994971Z SingleProcess AUTOTUNE benchmarking takes 0.1774 seconds and 1.8542 seconds precompiling for 13 choices 2025-12-04T09:41:44.2995145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.2995238Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.2995372Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.2995617Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.2996607Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.2996691Z graph_break [] 2025-12-04T09:41:44.2996795Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.2996971Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.2997061Z Autotune Choices Stats: 2025-12-04T09:41:44.2997898Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.2997992Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.2998077Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.2998185Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.2998664Z triton_mm_1016 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.2999178Z triton_mm_1009 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.2999719Z triton_mm_1010 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3000193Z triton_mm_1011 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3001106Z triton_mm_1012 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3001588Z triton_mm_1013 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3002070Z triton_mm_1014 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3002542Z triton_mm_1017 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3003017Z triton_mm_1018 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3003490Z triton_mm_1019 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3003817Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6098 seconds precompiling for 15 choices 2025-12-04T09:41:44.3003918Z Autotune Choices Stats: 2025-12-04T09:41:44.3004807Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1040", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3004911Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3004995Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3005096Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3005572Z triton_mm_1040 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3006109Z triton_mm_1041 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3006583Z triton_mm_1039 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3007055Z triton_mm_1044 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3007565Z triton_mm_1038 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3008098Z triton_mm_1043 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3008568Z triton_mm_1042 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3009108Z triton_mm_1045 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3009580Z triton_mm_1048 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3010057Z triton_mm_1047 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3010384Z SingleProcess AUTOTUNE benchmarking takes 0.1765 seconds and 1.8296 seconds precompiling for 13 choices 2025-12-04T09:41:44.3010475Z Autotune Choices Stats: 2025-12-04T09:41:44.3011363Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1074", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3011459Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3011548Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3011662Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3012139Z triton_mm_1074 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3012616Z triton_mm_1075 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3013084Z triton_mm_1067 0.0268 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3013548Z triton_mm_1064 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3014064Z triton_mm_1065 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3014532Z triton_mm_1066 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3014996Z triton_mm_1068 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3015532Z triton_mm_1071 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3016003Z triton_mm_1072 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3016475Z triton_mm_1073 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3016804Z SingleProcess AUTOTUNE benchmarking takes 0.2009 seconds and 0.6104 seconds precompiling for 15 choices 2025-12-04T09:41:44.3016978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3017075Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3017231Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3017500Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3018444Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3018570Z graph_break [] 2025-12-04T09:41:44.3018677Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3018859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3018950Z Autotune Choices Stats: 2025-12-04T09:41:44.3019777Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1094", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3019913Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3019999Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3020109Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3020584Z triton_mm_1094 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3021058Z triton_mm_1096 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3021527Z triton_mm_1097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3022002Z triton_mm_1102 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3022482Z triton_mm_1103 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3022957Z triton_mm_1105 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3023472Z triton_mm_1101 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3023953Z triton_mm_1108 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3024455Z triton_mm_1099 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3024930Z triton_mm_1095 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3025263Z SingleProcess AUTOTUNE benchmarking takes 0.2001 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.3025364Z Autotune Choices Stats: 2025-12-04T09:41:44.3026192Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1124", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3026290Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3026376Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3026478Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3026955Z triton_mm_1124 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3027430Z triton_mm_1125 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3027947Z triton_mm_1126 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3028414Z triton_mm_1127 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3028881Z triton_mm_1130 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3029395Z triton_mm_1129 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3029860Z triton_mm_1128 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3030339Z triton_mm_1133 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3030807Z triton_mm_1134 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3031282Z triton_mm_1131 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3031613Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8198 seconds precompiling for 13 choices 2025-12-04T09:41:44.3031705Z Autotune Choices Stats: 2025-12-04T09:41:44.3032586Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1150", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3032679Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3032772Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3032881Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3033350Z triton_mm_1150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3033864Z triton_mm_1154 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3034333Z triton_mm_1156 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3034818Z triton_mm_1158 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3035297Z triton_mm_1159 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3035771Z triton_mm_1160 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3036258Z triton_mm_1162 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.3036724Z triton_mm_1152 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3037294Z triton_mm_1151 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3037775Z triton_mm_1153 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3038106Z SingleProcess AUTOTUNE benchmarking takes 0.4616 seconds and 0.6292 seconds precompiling for 15 choices 2025-12-04T09:41:44.3038280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3038411Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3038547Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3038791Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3040245Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3040329Z graph_break [] 2025-12-04T09:41:44.3040434Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3040619Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3040709Z Autotune Choices Stats: 2025-12-04T09:41:44.3041549Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1182", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3041645Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3041733Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3041888Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3042367Z triton_mm_1182 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3042889Z triton_mm_1183 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3043414Z triton_mm_1185 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3043893Z triton_mm_1188 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3044380Z triton_mm_1189 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3044855Z triton_mm_1190 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3045337Z triton_mm_1191 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3045824Z triton_mm_1192 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.3046304Z triton_mm_1193 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3046821Z triton_mm_1181 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3047164Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.5985 seconds precompiling for 15 choices 2025-12-04T09:41:44.3047274Z Autotune Choices Stats: 2025-12-04T09:41:44.3048171Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3048270Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3048355Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3048462Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3048946Z triton_mm_1211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3049420Z triton_mm_1212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3049890Z triton_mm_1210 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3050371Z triton_mm_1213 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3050841Z triton_mm_1216 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3051378Z triton_mm_1215 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3051843Z triton_mm_1214 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3052320Z triton_mm_1217 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3052835Z triton_mm_1219 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3053320Z triton_mm_1218 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3053653Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8586 seconds precompiling for 13 choices 2025-12-04T09:41:44.3053828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3053926Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3054056Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3054299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3055680Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3055802Z graph_break [] 2025-12-04T09:41:44.3055910Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3056086Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3056177Z Autotune Choices Stats: 2025-12-04T09:41:44.3057028Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3057124Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3057256Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3057366Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3057852Z triton_mm_1237 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3058327Z triton_mm_1224 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3058795Z triton_mm_1225 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3059267Z triton_mm_1226 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3059735Z triton_mm_1228 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3060203Z triton_mm_1229 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3060717Z triton_mm_1231 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3061183Z triton_mm_1227 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3061659Z triton_mm_1230 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3062170Z triton_mm_1232 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3062504Z SingleProcess AUTOTUNE benchmarking takes 0.2110 seconds and 0.7239 seconds precompiling for 15 choices 2025-12-04T09:41:44.3062599Z Autotune Choices Stats: 2025-12-04T09:41:44.3063451Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1254", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3063545Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3063630Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3063735Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3064222Z triton_mm_1254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3064696Z triton_mm_1255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3065217Z triton_mm_1256 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3065689Z triton_mm_1259 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3066160Z triton_mm_1253 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3066667Z triton_mm_1258 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3067141Z triton_mm_1257 0.0327 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3067614Z triton_mm_1263 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3068135Z triton_mm_1262 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3068608Z triton_mm_1260 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3068942Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8818 seconds precompiling for 13 choices 2025-12-04T09:41:44.3069120Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3069213Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3069346Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3069592Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3071015Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3071137Z graph_break [] 2025-12-04T09:41:44.3071245Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3071420Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3071516Z Autotune Choices Stats: 2025-12-04T09:41:44.3072349Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1266", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.3072451Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3072535Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3072639Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3073119Z triton_mm_1266 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3073596Z triton_mm_1274 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3074071Z triton_mm_1275 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3074585Z triton_mm_1276 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3075062Z triton_mm_1277 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3075540Z triton_mm_1279 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3076056Z triton_mm_1280 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3076527Z triton_mm_1268 0.0286 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3077002Z triton_mm_1269 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3077470Z triton_mm_1270 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3077812Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6230 seconds precompiling for 15 choices 2025-12-04T09:41:44.3077915Z Autotune Choices Stats: 2025-12-04T09:41:44.3078785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1297", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3078879Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3078968Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3079071Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3079647Z triton_mm_1297 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3080123Z triton_mm_1298 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3080635Z triton_mm_1299 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3081112Z triton_mm_1302 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3081580Z triton_mm_1301 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3082055Z triton_mm_1296 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3082522Z triton_mm_1300 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3082996Z triton_mm_1306 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3083470Z triton_mm_1303 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3083985Z triton_mm_1305 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3084320Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8759 seconds precompiling for 13 choices 2025-12-04T09:41:44.3084492Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3084585Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3084718Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3084964Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3086414Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3086501Z graph_break [] 2025-12-04T09:41:44.3086608Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3086783Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3086873Z Autotune Choices Stats: 2025-12-04T09:41:44.3087719Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3087814Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3087898Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3088007Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3088484Z triton_mm_1309 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3089000Z triton_mm_1311 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3089473Z triton_mm_1315 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3089989Z triton_mm_1316 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3090528Z triton_mm_1319 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3091005Z triton_mm_1320 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3091492Z triton_mm_1323 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3091962Z triton_mm_1317 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3092442Z triton_mm_1310 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3092906Z triton_mm_1312 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3093279Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.5944 seconds precompiling for 15 choices 2025-12-04T09:41:44.3093374Z Autotune Choices Stats: 2025-12-04T09:41:44.3094205Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1345", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.3094302Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3094389Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3094490Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3095008Z triton_mm_1345 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3095480Z triton_mm_1342 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3095956Z triton_mm_1341 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3096439Z triton_mm_1340 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3096909Z triton_mm_1339 0.0297 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3097414Z triton_mm_1344 0.0307 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3097894Z triton_mm_1343 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3098409Z triton_mm_1346 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3098883Z triton_mm_1348 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3099401Z triton_mm_1349 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3099731Z SingleProcess AUTOTUNE benchmarking takes 0.1755 seconds and 1.7927 seconds precompiling for 13 choices 2025-12-04T09:41:44.3099900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3100000Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3100131Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3100662Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3102040Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3102130Z graph_break [] 2025-12-04T09:41:44.3102232Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3102404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3102582Z Autotune Choices Stats: 2025-12-04T09:41:44.3103437Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3103529Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3103621Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3103726Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3104211Z triton_mm_1366 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3104740Z triton_mm_1353 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3105209Z triton_mm_1355 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3105682Z triton_mm_1356 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3106145Z triton_mm_1357 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3106612Z triton_mm_1358 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3107082Z triton_mm_1359 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3107558Z triton_mm_1360 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3108136Z triton_mm_1361 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3108607Z triton_mm_1363 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3108997Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6046 seconds precompiling for 15 choices 2025-12-04T09:41:44.3109091Z Autotune Choices Stats: 2025-12-04T09:41:44.3109920Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3110019Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3110104Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3110214Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3110686Z triton_mm_1383 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3111156Z triton_mm_1384 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3111632Z triton_mm_1388 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3112094Z triton_mm_1382 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3112606Z triton_mm_1385 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3113068Z triton_mm_1387 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3113536Z triton_mm_1386 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3114045Z triton_mm_1391 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3114517Z triton_mm_1392 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3114989Z triton_mm_1389 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3115317Z SingleProcess AUTOTUNE benchmarking takes 0.1758 seconds and 1.7919 seconds precompiling for 13 choices 2025-12-04T09:41:44.3115493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3115588Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3115718Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3115966Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3116900Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3116991Z graph_break [] 2025-12-04T09:41:44.3117133Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3117305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3117399Z Autotune Choices Stats: 2025-12-04T09:41:44.3118223Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.3118365Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3118449Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3118551Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3119031Z triton_mm_1401 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3119544Z triton_mm_1397 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3120011Z triton_mm_1398 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3120479Z triton_mm_1400 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3120948Z triton_mm_1403 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3121493Z triton_mm_1404 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3121965Z triton_mm_1406 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3122501Z triton_mm_1408 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3123020Z triton_mm_1409 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3123492Z triton_mm_1402 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3123821Z SingleProcess AUTOTUNE benchmarking takes 0.1999 seconds and 0.6096 seconds precompiling for 15 choices 2025-12-04T09:41:44.3123911Z Autotune Choices Stats: 2025-12-04T09:41:44.3124747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1428", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3124837Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3124927Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3125028Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3125501Z triton_mm_1428 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3125975Z triton_mm_1426 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3126482Z triton_mm_1431 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3126947Z triton_mm_1427 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3127409Z triton_mm_1425 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3127965Z triton_mm_1430 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3128432Z triton_mm_1429 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3128901Z triton_mm_1432 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3129374Z triton_mm_1434 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3129843Z triton_mm_1435 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3130177Z SingleProcess AUTOTUNE benchmarking takes 0.1762 seconds and 1.8307 seconds precompiling for 13 choices 2025-12-04T09:41:44.3130267Z Autotune Choices Stats: 2025-12-04T09:41:44.3131144Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1452", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3131245Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3131330Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3131443Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3131922Z triton_mm_1452 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3132432Z triton_mm_1454 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3132910Z triton_mm_1456 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3133391Z triton_mm_1458 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3133870Z triton_mm_1459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3134343Z triton_mm_1460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3134826Z triton_mm_1461 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3135297Z triton_mm_1462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3135817Z triton_mm_1464 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3136299Z triton_mm_1465 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3136626Z SingleProcess AUTOTUNE benchmarking takes 0.1995 seconds and 0.5977 seconds precompiling for 15 choices 2025-12-04T09:41:44.3136854Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3136946Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3137074Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3137330Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3138320Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3138409Z graph_break [] 2025-12-04T09:41:44.3138512Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3138685Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3138780Z Autotune Choices Stats: 2025-12-04T09:41:44.3139607Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3139746Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3139829Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3139931Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3140410Z triton_mm_1484 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3140889Z triton_mm_1490 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3141366Z triton_mm_1492 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3141887Z triton_mm_1495 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3142356Z triton_mm_1481 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3142829Z triton_mm_1482 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3143293Z triton_mm_1483 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3143766Z triton_mm_1485 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3144235Z triton_mm_1486 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3144701Z triton_mm_1487 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3145076Z SingleProcess AUTOTUNE benchmarking takes 0.2028 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.3145169Z Autotune Choices Stats: 2025-12-04T09:41:44.3146012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3146149Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3146240Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3146342Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3146816Z triton_mm_1514 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3147292Z triton_mm_1512 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3147753Z triton_mm_1513 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3148225Z triton_mm_1517 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3148692Z triton_mm_1511 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3149155Z triton_mm_1516 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3149665Z triton_mm_1515 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3150132Z triton_mm_1518 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3150654Z triton_mm_1520 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3151169Z triton_mm_1521 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3151508Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8620 seconds precompiling for 13 choices 2025-12-04T09:41:44.3151597Z Autotune Choices Stats: 2025-12-04T09:41:44.3152431Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3152526Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3152610Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3152721Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3153197Z triton_mm_1541 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3153665Z triton_mm_1542 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3154182Z triton_mm_1543 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3154658Z triton_mm_1545 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3155138Z triton_mm_1546 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3155690Z triton_mm_1547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3156181Z triton_mm_1550 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3156664Z triton_mm_1551 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3157129Z triton_mm_1539 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3157603Z triton_mm_1540 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3157933Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6023 seconds precompiling for 15 choices 2025-12-04T09:41:44.3158108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3158242Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3158370Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3158620Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3159615Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3159700Z graph_break [] 2025-12-04T09:41:44.3159805Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3159977Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3160115Z Autotune Choices Stats: 2025-12-04T09:41:44.3160954Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1568", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3161048Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3161134Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3161239Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3161721Z triton_mm_1568 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3162192Z triton_mm_1569 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3162667Z triton_mm_1573 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3163149Z triton_mm_1577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3163673Z triton_mm_1581 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3164161Z triton_mm_1578 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3164666Z triton_mm_1567 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3165140Z triton_mm_1575 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3165606Z triton_mm_1570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3166071Z triton_mm_1571 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3166404Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6105 seconds precompiling for 15 choices 2025-12-04T09:41:44.3166497Z Autotune Choices Stats: 2025-12-04T09:41:44.3167380Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1599", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.3167481Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3167606Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3167713Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3168188Z triton_mm_1599 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3168660Z triton_mm_1598 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3169122Z triton_mm_1600 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3169627Z triton_mm_1603 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3170098Z triton_mm_1597 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3170565Z triton_mm_1602 0.0307 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3171032Z triton_mm_1601 0.0328 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3171502Z triton_mm_1604 0.0338 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3171982Z triton_mm_1606 0.0347 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3172450Z triton_mm_1607 0.0348 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3172817Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8699 seconds precompiling for 13 choices 2025-12-04T09:41:44.3172914Z Autotune Choices Stats: 2025-12-04T09:41:44.3173757Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3173890Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3173973Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3174083Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3174563Z triton_mm_1630 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3175034Z triton_mm_1624 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3175504Z triton_mm_1626 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3175968Z triton_mm_1627 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3176440Z triton_mm_1628 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3176903Z triton_mm_1629 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3177479Z triton_mm_1632 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3177950Z triton_mm_1633 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3178423Z triton_mm_1634 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3178945Z triton_mm_1637 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3179274Z SingleProcess AUTOTUNE benchmarking takes 0.2007 seconds and 0.6323 seconds precompiling for 15 choices 2025-12-04T09:41:44.3179448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3179543Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3179674Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3179924Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3181294Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3181382Z graph_break [] 2025-12-04T09:41:44.3181485Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3181659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3181753Z Autotune Choices Stats: 2025-12-04T09:41:44.3182624Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3182719Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3182807Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3182951Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3183432Z triton_mm_1656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3183957Z triton_mm_1658 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3184438Z triton_mm_1663 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3184916Z triton_mm_1662 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3185380Z triton_mm_1659 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3185851Z triton_mm_1653 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3186321Z triton_mm_1654 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3186839Z triton_mm_1657 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3187342Z triton_mm_1660 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3187832Z triton_mm_1664 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3188209Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6259 seconds precompiling for 15 choices 2025-12-04T09:41:44.3188301Z Autotune Choices Stats: 2025-12-04T09:41:44.3189150Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3189247Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3189334Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3189441Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3189919Z triton_mm_1684 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3190392Z triton_mm_1685 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3190856Z triton_mm_1686 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3191321Z triton_mm_1689 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3191854Z triton_mm_1688 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3192328Z triton_mm_1683 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3192845Z triton_mm_1687 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3193317Z triton_mm_1690 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3193795Z triton_mm_1692 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3194266Z triton_mm_1691 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3194596Z SingleProcess AUTOTUNE benchmarking takes 0.1785 seconds and 1.8989 seconds precompiling for 13 choices 2025-12-04T09:41:44.3194776Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3194870Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3195006Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3195253Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3196669Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3196757Z graph_break [] 2025-12-04T09:41:44.3196860Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3197036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3197130Z Autotune Choices Stats: 2025-12-04T09:41:44.3198058Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1707", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.3198158Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3198242Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3198352Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3198834Z triton_mm_1707 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3199304Z triton_mm_1697 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3199824Z triton_mm_1698 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3200560Z triton_mm_1699 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3201047Z triton_mm_1701 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3201600Z triton_mm_1703 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3202076Z triton_mm_1704 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3202605Z triton_mm_1705 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3203071Z triton_mm_1700 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3203562Z triton_mm_1706 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3203889Z SingleProcess AUTOTUNE benchmarking takes 0.2035 seconds and 0.6139 seconds precompiling for 15 choices 2025-12-04T09:41:44.3203984Z Autotune Choices Stats: 2025-12-04T09:41:44.3204808Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1732", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3204903Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3204990Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3205091Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3205633Z triton_mm_1732 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3206097Z triton_mm_1728 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3206559Z triton_mm_1729 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3207108Z triton_mm_1727 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3207573Z triton_mm_1726 0.0297 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3208094Z triton_mm_1731 0.0317 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3208557Z triton_mm_1730 0.0328 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3209027Z triton_mm_1733 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3209503Z triton_mm_1736 0.0338 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3209969Z triton_mm_1735 0.0348 ms 76.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3210301Z SingleProcess AUTOTUNE benchmarking takes 0.1761 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.3210474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3210614Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3210745Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3210990Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3212366Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3212495Z graph_break [] 2025-12-04T09:41:44.3212599Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3212771Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3212861Z Autotune Choices Stats: 2025-12-04T09:41:44.3213709Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1747", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3213804Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3213891Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3213993Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3214472Z triton_mm_1747 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3214988Z triton_mm_1748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3215464Z triton_mm_1749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3215942Z triton_mm_1750 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3216459Z triton_mm_1753 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3216936Z triton_mm_1746 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3217403Z triton_mm_1740 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3217866Z triton_mm_1744 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3218334Z triton_mm_1741 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3218799Z triton_mm_1742 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3219133Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6267 seconds precompiling for 15 choices 2025-12-04T09:41:44.3219227Z Autotune Choices Stats: 2025-12-04T09:41:44.3220120Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1770", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3220217Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3220302Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3220406Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3220886Z triton_mm_1770 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3221391Z triton_mm_1772 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3221859Z triton_mm_1771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3222326Z triton_mm_1769 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3222798Z triton_mm_1775 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3223260Z triton_mm_1774 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3223729Z triton_mm_1773 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3224238Z triton_mm_1776 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3224709Z triton_mm_1778 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3225189Z triton_mm_1779 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3225521Z SingleProcess AUTOTUNE benchmarking takes 0.1791 seconds and 1.9442 seconds precompiling for 13 choices 2025-12-04T09:41:44.3225760Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3225855Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3225986Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3226237Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3227662Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3227748Z graph_break [] 2025-12-04T09:41:44.3227852Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3228022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3228118Z Autotune Choices Stats: 2025-12-04T09:41:44.3228999Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1784", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3229097Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3229180Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3229329Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3229814Z triton_mm_1784 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3230291Z triton_mm_1790 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3230813Z triton_mm_1791 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3231294Z triton_mm_1792 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3231777Z triton_mm_1793 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3232263Z triton_mm_1795 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3232729Z triton_mm_1782 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3233205Z triton_mm_1783 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3233712Z triton_mm_1785 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3234225Z triton_mm_1787 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3234556Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.3234647Z Autotune Choices Stats: 2025-12-04T09:41:44.3235522Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1813", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3235615Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3235707Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3235809Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3236287Z triton_mm_1813 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3236770Z triton_mm_1814 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3237269Z triton_mm_1815 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3237769Z triton_mm_1818 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3238234Z triton_mm_1812 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3238747Z triton_mm_1817 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3239216Z triton_mm_1816 0.0338 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3239742Z triton_mm_1819 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3240263Z triton_mm_1821 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3240734Z triton_mm_1822 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3241065Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.7496 seconds precompiling for 13 choices 2025-12-04T09:41:44.3241236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3241328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3241461Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3241705Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3242654Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3242777Z graph_break [] 2025-12-04T09:41:44.3242881Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3243059Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3243148Z Autotune Choices Stats: 2025-12-04T09:41:44.3243995Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1838", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.3244090Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3244175Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3244280Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3244810Z triton_mm_1838 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3245281Z triton_mm_1830 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3245752Z triton_mm_1831 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3246225Z triton_mm_1835 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3246712Z triton_mm_1827 0.0278 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3247188Z triton_mm_1839 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3247658Z triton_mm_1825 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3248168Z triton_mm_1826 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3248636Z triton_mm_1828 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3253861Z triton_mm_1832 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3254217Z SingleProcess AUTOTUNE benchmarking takes 0.2048 seconds and 0.6147 seconds precompiling for 15 choices 2025-12-04T09:41:44.3254322Z Autotune Choices Stats: 2025-12-04T09:41:44.3255166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1855", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3255274Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3255360Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3255469Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3255952Z triton_mm_1855 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3256436Z triton_mm_1856 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3256907Z triton_mm_1857 0.0288 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3257499Z triton_mm_1858 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3257967Z triton_mm_1861 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3258431Z triton_mm_1860 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3258938Z triton_mm_1859 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3259420Z triton_mm_1864 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3259890Z triton_mm_1862 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3260365Z triton_mm_1865 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3260700Z SingleProcess AUTOTUNE benchmarking takes 0.1829 seconds and 1.8425 seconds precompiling for 13 choices 2025-12-04T09:41:44.3260793Z Autotune Choices Stats: 2025-12-04T09:41:44.3261626Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3261724Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3261814Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3261927Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3262441Z triton_mm_1881 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3262915Z triton_mm_1884 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3263429Z triton_mm_1885 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3263901Z triton_mm_1887 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3264384Z triton_mm_1889 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3264865Z triton_mm_1890 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3265340Z triton_mm_1892 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3265823Z triton_mm_1894 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3266297Z triton_mm_1888 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3266837Z triton_mm_1883 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3267169Z SingleProcess AUTOTUNE benchmarking takes 0.2040 seconds and 0.6216 seconds precompiling for 15 choices 2025-12-04T09:41:44.3267347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3267446Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3267584Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3267920Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3269376Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3269465Z graph_break [] 2025-12-04T09:41:44.3269573Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3269747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3269842Z Autotune Choices Stats: 2025-12-04T09:41:44.3270692Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1925", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02755199931561947, "best_triton_pos": 0} 2025-12-04T09:41:44.3270789Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3270880Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3270992Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3271520Z triton_mm_1925 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3272004Z triton_mm_1921 0.0276 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3272475Z triton_mm_1920 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3273008Z triton_mm_1922 0.0277 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3273484Z triton_mm_1924 0.0285 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3273953Z triton_mm_1911 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3274430Z triton_mm_1912 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3274895Z triton_mm_1913 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3275376Z triton_mm_1918 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3275846Z triton_mm_1919 0.0287 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3276220Z SingleProcess AUTOTUNE benchmarking takes 0.2057 seconds and 0.6159 seconds precompiling for 15 choices 2025-12-04T09:41:44.3276314Z Autotune Choices Stats: 2025-12-04T09:41:44.3277165Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1942", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3277264Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3277353Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3277501Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3277996Z triton_mm_1942 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3278468Z triton_mm_1943 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3278941Z triton_mm_1944 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3279403Z triton_mm_1947 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3279930Z triton_mm_1941 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3280395Z triton_mm_1946 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3280905Z triton_mm_1945 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3281383Z triton_mm_1951 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3281851Z triton_mm_1948 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3282372Z triton_mm_1949 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3282701Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.7698 seconds precompiling for 13 choices 2025-12-04T09:41:44.3282880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3282976Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3283113Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3283361Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3284728Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3284819Z graph_break [] 2025-12-04T09:41:44.3284928Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3285147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3285242Z Autotune Choices Stats: 2025-12-04T09:41:44.3286080Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1955", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3286174Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3286264Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3286370Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3286907Z triton_mm_1955 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3287435Z triton_mm_1963 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3287902Z triton_mm_1960 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3288379Z triton_mm_1962 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3288857Z triton_mm_1966 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.3289332Z triton_mm_1954 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3289798Z triton_mm_1956 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3290304Z triton_mm_1957 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3290772Z triton_mm_1958 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3291235Z triton_mm_1959 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3291612Z SingleProcess AUTOTUNE benchmarking takes 0.2063 seconds and 0.6187 seconds precompiling for 15 choices 2025-12-04T09:41:44.3291708Z Autotune Choices Stats: 2025-12-04T09:41:44.3292563Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_1985", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3292664Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3292754Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3292862Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3293337Z triton_mm_1985 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3293812Z triton_mm_1987 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3294280Z triton_mm_1984 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3294791Z triton_mm_1986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3295257Z triton_mm_1990 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3295718Z triton_mm_1989 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3296223Z triton_mm_1988 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3296694Z triton_mm_1991 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3297186Z triton_mm_1993 0.0347 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3297696Z triton_mm_1992 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3298028Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8429 seconds precompiling for 13 choices 2025-12-04T09:41:44.3298204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3298299Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3298435Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3298680Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3300091Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3300179Z graph_break [] 2025-12-04T09:41:44.3300479Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3300721Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3300934Z Autotune Choices Stats: 2025-12-04T09:41:44.3301779Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_1997", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3301881Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3301967Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3302081Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3302556Z triton_mm_1997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3303030Z triton_mm_2004 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3303509Z triton_mm_2005 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3303982Z triton_mm_2007 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3304524Z triton_mm_2011 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3304987Z triton_mm_1999 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3305453Z triton_mm_2001 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3305991Z triton_mm_2006 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3306462Z triton_mm_1998 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3306934Z triton_mm_2000 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3307268Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6232 seconds precompiling for 15 choices 2025-12-04T09:41:44.3307360Z Autotune Choices Stats: 2025-12-04T09:41:44.3308192Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2028", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3308286Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3308376Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3308479Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3308961Z triton_mm_2028 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3309490Z triton_mm_2033 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3309962Z triton_mm_2027 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3310467Z triton_mm_2029 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3310928Z triton_mm_2030 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3311398Z triton_mm_2032 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3311860Z triton_mm_2031 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3312333Z triton_mm_2036 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3312805Z triton_mm_2037 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3313270Z triton_mm_2034 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3313645Z SingleProcess AUTOTUNE benchmarking takes 0.1773 seconds and 1.8353 seconds precompiling for 13 choices 2025-12-04T09:41:44.3313817Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3313916Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3314044Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3314288Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3315693Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3315781Z graph_break [] 2025-12-04T09:41:44.3315885Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3316057Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3316146Z Autotune Choices Stats: 2025-12-04T09:41:44.3317026Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2042", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3317129Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3317237Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3317353Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3317845Z triton_mm_2042 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3318316Z triton_mm_2040 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3318823Z triton_mm_2044 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3319303Z triton_mm_2045 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3319869Z triton_mm_2046 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3320343Z triton_mm_2047 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3320817Z triton_mm_2048 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3321289Z triton_mm_2050 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3321762Z triton_mm_2051 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3322242Z triton_mm_2054 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3322574Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6152 seconds precompiling for 15 choices 2025-12-04T09:41:44.3322709Z Autotune Choices Stats: 2025-12-04T09:41:44.3323547Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2071", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3323644Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3323729Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3323835Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3324358Z triton_mm_2071 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3324878Z triton_mm_2072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3325358Z triton_mm_2073 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3325828Z triton_mm_2070 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3326297Z triton_mm_2076 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3326760Z triton_mm_2075 0.0318 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3327259Z triton_mm_2074 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3327757Z triton_mm_2077 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3328266Z triton_mm_2080 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3328741Z triton_mm_2079 0.0358 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3329108Z SingleProcess AUTOTUNE benchmarking takes 0.1806 seconds and 1.8730 seconds precompiling for 13 choices 2025-12-04T09:41:44.3329283Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3329376Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3329505Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3329763Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3330704Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3330789Z graph_break [] 2025-12-04T09:41:44.3330892Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3331063Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3331160Z Autotune Choices Stats: 2025-12-04T09:41:44.3331992Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2083", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3332130Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3332214Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3332318Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3332800Z triton_mm_2083 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3333271Z triton_mm_2085 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3333790Z triton_mm_2088 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3334262Z triton_mm_2089 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3334742Z triton_mm_2091 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3335231Z triton_mm_2097 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3335701Z triton_mm_2090 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3336177Z triton_mm_2086 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3336651Z triton_mm_2084 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3337121Z triton_mm_2087 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3337543Z SingleProcess AUTOTUNE benchmarking takes 0.2038 seconds and 0.6089 seconds precompiling for 15 choices 2025-12-04T09:41:44.3337656Z Autotune Choices Stats: 2025-12-04T09:41:44.3338494Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2116", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3338627Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3338716Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3338818Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3339293Z triton_mm_2116 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3339766Z triton_mm_2119 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3340236Z triton_mm_2113 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3340709Z triton_mm_2114 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3341178Z triton_mm_2115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3341690Z triton_mm_2118 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3342161Z triton_mm_2117 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3342631Z triton_mm_2123 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3343147Z triton_mm_2122 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3343618Z triton_mm_2120 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3343948Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8241 seconds precompiling for 13 choices 2025-12-04T09:41:44.3344037Z Autotune Choices Stats: 2025-12-04T09:41:44.3344882Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2142", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3344978Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3345063Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3345178Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3345658Z triton_mm_2142 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3346129Z triton_mm_2143 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3346645Z triton_mm_2144 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3347114Z triton_mm_2145 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3347593Z triton_mm_2146 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3348116Z triton_mm_2147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3348591Z triton_mm_2149 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3349074Z triton_mm_2150 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3349556Z triton_mm_2152 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3350031Z triton_mm_2148 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3350361Z SingleProcess AUTOTUNE benchmarking takes 0.5217 seconds and 0.6015 seconds precompiling for 15 choices 2025-12-04T09:41:44.3350537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3350674Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3350802Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3351056Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3352433Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3352522Z graph_break [] 2025-12-04T09:41:44.3352663Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3352837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3352931Z Autotune Choices Stats: 2025-12-04T09:41:44.3353816Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2174", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3353913Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3353998Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3354100Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3354578Z triton_mm_2174 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3355056Z triton_mm_2170 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3355528Z triton_mm_2175 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3356046Z triton_mm_2176 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3356519Z triton_mm_2177 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3356993Z triton_mm_2178 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3357562Z triton_mm_2180 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3358036Z triton_mm_2169 0.0277 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3358511Z triton_mm_2179 0.0278 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3358980Z triton_mm_2171 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3359308Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.6199 seconds precompiling for 15 choices 2025-12-04T09:41:44.3359398Z Autotune Choices Stats: 2025-12-04T09:41:44.3360290Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2200", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3360429Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3360519Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3360621Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3361106Z triton_mm_2200 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3361582Z triton_mm_2201 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3362094Z triton_mm_2202 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3362571Z triton_mm_2205 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3363046Z triton_mm_2199 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3363509Z triton_mm_2204 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3363976Z triton_mm_2203 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3364449Z triton_mm_2206 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3364926Z triton_mm_2208 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3365437Z triton_mm_2209 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3365770Z SingleProcess AUTOTUNE benchmarking takes 0.1803 seconds and 1.8733 seconds precompiling for 13 choices 2025-12-04T09:41:44.3365940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3366120Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3366253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3366499Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3367460Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3367549Z graph_break [] 2025-12-04T09:41:44.3367678Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3367854Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3367946Z Autotune Choices Stats: 2025-12-04T09:41:44.3368772Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2212", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3368877Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3368961Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3369066Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3369536Z triton_mm_2212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3370068Z triton_mm_2213 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3370540Z triton_mm_2214 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3371013Z triton_mm_2218 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3371564Z triton_mm_2221 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3372043Z triton_mm_2223 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3372528Z triton_mm_2226 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3372994Z triton_mm_2217 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3373476Z triton_mm_2225 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3373943Z triton_mm_2216 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3374271Z SingleProcess AUTOTUNE benchmarking takes 0.2036 seconds and 0.6236 seconds precompiling for 15 choices 2025-12-04T09:41:44.3374365Z Autotune Choices Stats: 2025-12-04T09:41:44.3375231Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2245", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3375326Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3375455Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3375560Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3376037Z triton_mm_2245 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3376503Z triton_mm_2244 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3376971Z triton_mm_2242 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3377495Z triton_mm_2243 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3377960Z triton_mm_2248 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3378432Z triton_mm_2247 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3378897Z triton_mm_2246 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3379415Z triton_mm_2251 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3379887Z triton_mm_2252 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3380355Z triton_mm_2249 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3380724Z SingleProcess AUTOTUNE benchmarking takes 0.1788 seconds and 1.8303 seconds precompiling for 13 choices 2025-12-04T09:41:44.3380815Z Autotune Choices Stats: 2025-12-04T09:41:44.3381643Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02751999907195568, "best_triton_pos": 0} 2025-12-04T09:41:44.3381738Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3381822Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3381935Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3382407Z triton_mm_2272 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3382882Z triton_mm_2268 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3383353Z triton_mm_2269 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3383869Z triton_mm_2270 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3384335Z triton_mm_2271 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3384801Z triton_mm_2273 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3385310Z triton_mm_2274 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3385782Z triton_mm_2275 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3386260Z triton_mm_2276 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3386727Z triton_mm_2277 0.0276 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3387060Z SingleProcess AUTOTUNE benchmarking takes 0.2011 seconds and 0.6055 seconds precompiling for 15 choices 2025-12-04T09:41:44.3387233Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3387328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3387462Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3387740Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3389179Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3389265Z graph_break [] 2025-12-04T09:41:44.3389367Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3389543Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3389634Z Autotune Choices Stats: 2025-12-04T09:41:44.3390502Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3390611Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3390694Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3390797Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3391277Z triton_mm_2300 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3391764Z triton_mm_2312 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3392243Z triton_mm_2309 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3392711Z triton_mm_2299 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3393224Z triton_mm_2298 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3393691Z triton_mm_2301 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3394156Z triton_mm_2302 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3394669Z triton_mm_2303 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3395142Z triton_mm_2305 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3395666Z triton_mm_2304 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3396000Z SingleProcess AUTOTUNE benchmarking takes 0.2055 seconds and 0.6075 seconds precompiling for 15 choices 2025-12-04T09:41:44.3396092Z Autotune Choices Stats: 2025-12-04T09:41:44.3396936Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3397029Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3397116Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3397219Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3397748Z triton_mm_2329 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3398271Z triton_mm_2330 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3398736Z triton_mm_2331 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3399250Z triton_mm_2328 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3399760Z triton_mm_2334 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3400230Z triton_mm_2333 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3401051Z triton_mm_2332 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3401535Z triton_mm_2338 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3402013Z triton_mm_2335 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3402483Z triton_mm_2337 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3402814Z SingleProcess AUTOTUNE benchmarking takes 0.1778 seconds and 1.8164 seconds precompiling for 13 choices 2025-12-04T09:41:44.3403073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3403168Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3403304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3403552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3404993Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3405079Z graph_break [] 2025-12-04T09:41:44.3405180Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3405355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3405448Z Autotune Choices Stats: 2025-12-04T09:41:44.3406275Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3406371Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3406455Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3406565Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3407040Z triton_mm_2343 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3407604Z triton_mm_2349 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3408086Z triton_mm_2355 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3408558Z triton_mm_2347 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3409091Z triton_mm_2351 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3409560Z triton_mm_2341 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3410037Z triton_mm_2342 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3410502Z triton_mm_2344 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3410976Z triton_mm_2352 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3411455Z triton_mm_2353 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.3411781Z SingleProcess AUTOTUNE benchmarking takes 0.2044 seconds and 0.6205 seconds precompiling for 15 choices 2025-12-04T09:41:44.3411876Z Autotune Choices Stats: 2025-12-04T09:41:44.3412756Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2372", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3412852Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3412938Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3413038Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3413562Z triton_mm_2372 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3414030Z triton_mm_2371 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3414501Z triton_mm_2373 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3414968Z triton_mm_2374 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3415436Z triton_mm_2377 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3415904Z triton_mm_2376 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3416372Z triton_mm_2375 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3416887Z triton_mm_2378 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3417406Z triton_mm_2381 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3417882Z triton_mm_2379 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3418213Z SingleProcess AUTOTUNE benchmarking takes 0.1763 seconds and 1.8240 seconds precompiling for 13 choices 2025-12-04T09:41:44.3418428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3418525Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3418654Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3418903Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3420274Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3420360Z graph_break [] 2025-12-04T09:41:44.3420461Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3420635Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3420729Z Autotune Choices Stats: 2025-12-04T09:41:44.3421556Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2388", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.02675200067460537, "best_triton_pos": 0} 2025-12-04T09:41:44.3421650Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3421781Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3421890Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3422368Z triton_mm_2388 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3422873Z triton_mm_2384 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3423337Z triton_mm_2387 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3423812Z triton_mm_2391 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3424283Z triton_mm_2392 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3424755Z triton_mm_2393 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3425235Z triton_mm_2396 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.3425711Z triton_mm_2397 0.0276 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3426219Z triton_mm_2394 0.0286 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3426692Z triton_mm_2385 0.0287 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3427021Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6219 seconds precompiling for 15 choices 2025-12-04T09:41:44.3427114Z Autotune Choices Stats: 2025-12-04T09:41:44.3428039Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2415", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3428148Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3428231Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3428336Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3428818Z triton_mm_2415 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3429293Z triton_mm_2416 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3429759Z triton_mm_2414 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3430229Z triton_mm_2417 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3430696Z triton_mm_2420 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3431199Z triton_mm_2419 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3431670Z triton_mm_2418 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3432182Z triton_mm_2423 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3432660Z triton_mm_2424 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3433131Z triton_mm_2421 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3433462Z SingleProcess AUTOTUNE benchmarking takes 0.1780 seconds and 1.8157 seconds precompiling for 13 choices 2025-12-04T09:41:44.3433638Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3433731Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3433866Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3434113Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3435056Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3435181Z graph_break [] 2025-12-04T09:41:44.3435283Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3435459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3435552Z Autotune Choices Stats: 2025-12-04T09:41:44.3436431Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2427", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3436531Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3436615Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3436762Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3437245Z triton_mm_2427 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3437715Z triton_mm_2429 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3438188Z triton_mm_2432 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3438655Z triton_mm_2436 0.0277 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3439132Z triton_mm_2438 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3439649Z triton_mm_2437 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3440119Z triton_mm_2428 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3440625Z triton_mm_2430 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3441087Z triton_mm_2431 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3441620Z triton_mm_2433 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3441944Z SingleProcess AUTOTUNE benchmarking takes 0.2016 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.3442040Z Autotune Choices Stats: 2025-12-04T09:41:44.3442864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2459", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3442955Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3443042Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3443143Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3443619Z triton_mm_2459 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3444089Z triton_mm_2460 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3444595Z triton_mm_2457 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3445073Z triton_mm_2458 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3445534Z triton_mm_2463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3446043Z triton_mm_2462 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3446507Z triton_mm_2461 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3446985Z triton_mm_2466 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3447505Z triton_mm_2464 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3447973Z triton_mm_2465 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3448307Z SingleProcess AUTOTUNE benchmarking takes 0.1760 seconds and 1.8101 seconds precompiling for 13 choices 2025-12-04T09:41:44.3448401Z Autotune Choices Stats: 2025-12-04T09:41:44.3449229Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3449322Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3449406Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3449563Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3450031Z triton_mm_2488 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3450500Z triton_mm_2485 0.0267 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3451004Z triton_mm_2483 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3451473Z triton_mm_2484 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3451945Z triton_mm_2486 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3452406Z triton_mm_2489 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3452875Z triton_mm_2490 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3453345Z triton_mm_2492 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3453856Z triton_mm_2493 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3454325Z triton_mm_2494 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3454653Z SingleProcess AUTOTUNE benchmarking takes 0.5384 seconds and 0.6144 seconds precompiling for 15 choices 2025-12-04T09:41:44.3454830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3454927Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3455061Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3455345Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3456283Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3456370Z graph_break [] 2025-12-04T09:41:44.3456477Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3456647Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3456741Z Autotune Choices Stats: 2025-12-04T09:41:44.3457630Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2524", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.3457731Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3457816Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3457918Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3458406Z triton_mm_2524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3458915Z triton_mm_2515 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3459383Z triton_mm_2519 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3459894Z triton_mm_2522 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3460374Z triton_mm_2526 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3460844Z triton_mm_2523 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3461309Z triton_mm_2518 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3461776Z triton_mm_2516 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3462241Z triton_mm_2513 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3462717Z triton_mm_2514 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3463090Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6466 seconds precompiling for 15 choices 2025-12-04T09:41:44.3463180Z Autotune Choices Stats: 2025-12-04T09:41:44.3464006Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2543", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3464097Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3464187Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3464289Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3464927Z triton_mm_2543 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3465408Z triton_mm_2544 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3465919Z triton_mm_2545 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3466399Z triton_mm_2546 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3466867Z triton_mm_2549 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3467338Z triton_mm_2548 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3467804Z triton_mm_2547 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3468318Z triton_mm_2553 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3468792Z triton_mm_2550 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3469301Z triton_mm_2551 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3469634Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8557 seconds precompiling for 13 choices 2025-12-04T09:41:44.3469723Z Autotune Choices Stats: 2025-12-04T09:41:44.3470606Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2571", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3470699Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3470787Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3470901Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3471372Z triton_mm_2571 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3471849Z triton_mm_2572 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3472321Z triton_mm_2575 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3472844Z triton_mm_2577 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3473332Z triton_mm_2582 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3473803Z triton_mm_2579 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3474320Z triton_mm_2580 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3474796Z triton_mm_2583 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3475274Z triton_mm_2570 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3475738Z triton_mm_2574 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3476065Z SingleProcess AUTOTUNE benchmarking takes 0.2034 seconds and 0.6163 seconds precompiling for 15 choices 2025-12-04T09:41:44.3476246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3476340Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3476469Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3476717Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3478224Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3478310Z graph_break [] 2025-12-04T09:41:44.3478412Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3478627Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3478720Z Autotune Choices Stats: 2025-12-04T09:41:44.3479599Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.3479701Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3479786Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3479893Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3480369Z triton_mm_2605 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3480837Z triton_mm_2602 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3481320Z triton_mm_2606 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3481792Z triton_mm_2609 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3482317Z triton_mm_2610 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3482793Z triton_mm_2613 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3483258Z triton_mm_2604 0.0277 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3483768Z triton_mm_2601 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3484234Z triton_mm_2599 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3484714Z triton_mm_2600 0.0287 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3485040Z SingleProcess AUTOTUNE benchmarking takes 0.2015 seconds and 0.6405 seconds precompiling for 15 choices 2025-12-04T09:41:44.3485132Z Autotune Choices Stats: 2025-12-04T09:41:44.3485965Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3486060Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3486150Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3486255Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3486727Z triton_mm_2635 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3487235Z triton_mm_2632 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3487758Z triton_mm_2630 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3488266Z triton_mm_2631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3488731Z triton_mm_2629 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3489202Z triton_mm_2634 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3489665Z triton_mm_2633 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3490135Z triton_mm_2638 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3490614Z triton_mm_2639 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3491081Z triton_mm_2636 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3491455Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8716 seconds precompiling for 13 choices 2025-12-04T09:41:44.3491629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3491723Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3491854Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3492097Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3493086Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3493171Z graph_break [] 2025-12-04T09:41:44.3493275Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3493451Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3493543Z Autotune Choices Stats: 2025-12-04T09:41:44.3494395Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3494486Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3494579Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3494686Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3495162Z triton_mm_2647 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3495649Z triton_mm_2654 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=8 2025-12-04T09:41:44.3496172Z triton_mm_2656 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3496642Z triton_mm_2643 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3497109Z triton_mm_2642 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3497615Z triton_mm_2644 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3498086Z triton_mm_2645 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3498556Z triton_mm_2646 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3499024Z triton_mm_2648 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3499495Z triton_mm_2649 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3499831Z SingleProcess AUTOTUNE benchmarking takes 0.2073 seconds and 0.6119 seconds precompiling for 15 choices 2025-12-04T09:41:44.3499926Z Autotune Choices Stats: 2025-12-04T09:41:44.3501039Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027712000533938408, "best_triton_pos": 0} 2025-12-04T09:41:44.3501135Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3501219Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3501322Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3501796Z triton_mm_2674 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3502334Z triton_mm_2675 0.0278 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3502801Z triton_mm_2672 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3503275Z triton_mm_2673 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3503739Z triton_mm_2678 0.0287 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3504203Z triton_mm_2677 0.0308 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3504670Z triton_mm_2676 0.0328 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3505143Z triton_mm_2682 0.0338 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3505673Z triton_mm_2679 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3506145Z triton_mm_2680 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3506472Z SingleProcess AUTOTUNE benchmarking takes 0.1777 seconds and 1.7626 seconds precompiling for 13 choices 2025-12-04T09:41:44.3506616Z Autotune Choices Stats: 2025-12-04T09:41:44.3507465Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3507558Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3507656Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3507782Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3508289Z triton_mm_2702 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3508761Z triton_mm_2699 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3509232Z triton_mm_2700 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3509700Z triton_mm_2703 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3510225Z triton_mm_2706 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3510703Z triton_mm_2711 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3511166Z triton_mm_2704 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3511702Z triton_mm_2709 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3512170Z triton_mm_2698 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3512636Z triton_mm_2701 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3512966Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6140 seconds precompiling for 15 choices 2025-12-04T09:41:44.3513139Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3513231Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3513365Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3513608Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3514976Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3515102Z graph_break [] 2025-12-04T09:41:44.3515206Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3515379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3515469Z Autotune Choices Stats: 2025-12-04T09:41:44.3516331Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3516464Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3516548Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3516658Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3517138Z triton_mm_2742 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3517616Z triton_mm_2730 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3518129Z triton_mm_2731 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3518601Z triton_mm_2736 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3519074Z triton_mm_2738 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3519635Z triton_mm_2741 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3520112Z triton_mm_2729 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3520582Z triton_mm_2739 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3521098Z triton_mm_2733 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3521565Z triton_mm_2728 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3521894Z SingleProcess AUTOTUNE benchmarking takes 0.2002 seconds and 0.6223 seconds precompiling for 15 choices 2025-12-04T09:41:44.3521992Z Autotune Choices Stats: 2025-12-04T09:41:44.3522830Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2759", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3522928Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3523011Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3523112Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3523594Z triton_mm_2759 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3524068Z triton_mm_2760 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3524582Z triton_mm_2761 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3525049Z triton_mm_2758 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3525561Z triton_mm_2764 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3526031Z triton_mm_2763 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3526497Z triton_mm_2762 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3526970Z triton_mm_2768 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3527436Z triton_mm_2765 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3527921Z triton_mm_2767 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3528250Z SingleProcess AUTOTUNE benchmarking takes 0.1772 seconds and 1.8114 seconds precompiling for 13 choices 2025-12-04T09:41:44.3528463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3528561Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3528689Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3528940Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3529885Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3529968Z graph_break [] 2025-12-04T09:41:44.3530076Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3530287Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3530381Z Autotune Choices Stats: 2025-12-04T09:41:44.3531223Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2778", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3535772Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3535878Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3535990Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3536500Z triton_mm_2778 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3536991Z triton_mm_2782 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3537513Z triton_mm_2775 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3537987Z triton_mm_2776 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3538522Z triton_mm_2773 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3539000Z triton_mm_2779 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3539511Z triton_mm_2771 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3539986Z triton_mm_2772 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3540459Z triton_mm_2774 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3540927Z triton_mm_2777 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3541264Z SingleProcess AUTOTUNE benchmarking takes 0.2069 seconds and 0.6137 seconds precompiling for 15 choices 2025-12-04T09:41:44.3541361Z Autotune Choices Stats: 2025-12-04T09:41:44.3542206Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2803", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3542345Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3542437Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3542548Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3543024Z triton_mm_2803 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3543504Z triton_mm_2804 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3544021Z triton_mm_2802 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3544495Z triton_mm_2806 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3544965Z triton_mm_2807 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3545435Z triton_mm_2801 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3545913Z triton_mm_2805 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3546392Z triton_mm_2808 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3546873Z triton_mm_2810 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3547359Z triton_mm_2811 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3547777Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8745 seconds precompiling for 13 choices 2025-12-04T09:41:44.3547870Z Autotune Choices Stats: 2025-12-04T09:41:44.3548718Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2830", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.3548859Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3548946Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3549059Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3549543Z triton_mm_2830 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3550018Z triton_mm_2829 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3550487Z triton_mm_2832 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3550962Z triton_mm_2835 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3551444Z triton_mm_2837 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3552001Z triton_mm_2841 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3552484Z triton_mm_2838 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3552960Z triton_mm_2834 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3553468Z triton_mm_2833 0.0286 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3553942Z triton_mm_2827 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3554274Z SingleProcess AUTOTUNE benchmarking takes 0.2004 seconds and 0.6102 seconds precompiling for 15 choices 2025-12-04T09:41:44.3554455Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3554555Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3554688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3554942Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3556331Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3556425Z graph_break [] 2025-12-04T09:41:44.3556531Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3556707Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3556804Z Autotune Choices Stats: 2025-12-04T09:41:44.3557688Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2858", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3557824Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3557909Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3558017Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3558500Z triton_mm_2858 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3558973Z triton_mm_2859 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3559448Z triton_mm_2861 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3559988Z triton_mm_2870 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3560466Z triton_mm_2866 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3560941Z triton_mm_2867 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3561463Z triton_mm_2865 0.0286 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3561935Z triton_mm_2857 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3562401Z triton_mm_2860 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3562914Z triton_mm_2862 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3563252Z SingleProcess AUTOTUNE benchmarking takes 0.2020 seconds and 0.6130 seconds precompiling for 15 choices 2025-12-04T09:41:44.3563348Z Autotune Choices Stats: 2025-12-04T09:41:44.3564210Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2888", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3564307Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3564395Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3564504Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3564987Z triton_mm_2888 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3565464Z triton_mm_2889 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3565931Z triton_mm_2890 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3566441Z triton_mm_2893 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3566907Z triton_mm_2887 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3567422Z triton_mm_2892 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3567934Z triton_mm_2891 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3568407Z triton_mm_2894 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3568885Z triton_mm_2897 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3569355Z triton_mm_2895 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3569692Z SingleProcess AUTOTUNE benchmarking takes 0.1798 seconds and 1.8281 seconds precompiling for 13 choices 2025-12-04T09:41:44.3569868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3569964Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3570099Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3570384Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3571326Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3571415Z graph_break [] 2025-12-04T09:41:44.3571520Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3571696Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3571791Z Autotune Choices Stats: 2025-12-04T09:41:44.3572669Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2910", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3572771Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3572860Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3572969Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3573458Z triton_mm_2910 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3573934Z triton_mm_2902 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3574413Z triton_mm_2908 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3574883Z triton_mm_2909 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3575362Z triton_mm_2911 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3575879Z triton_mm_2914 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3576354Z triton_mm_2900 0.0285 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3576868Z triton_mm_2907 0.0286 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3577341Z triton_mm_2901 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3577816Z triton_mm_2903 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3578202Z SingleProcess AUTOTUNE benchmarking takes 0.2107 seconds and 0.6175 seconds precompiling for 15 choices 2025-12-04T09:41:44.3578296Z Autotune Choices Stats: 2025-12-04T09:41:44.3579127Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_2933", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3579226Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3579317Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3579422Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3579904Z triton_mm_2933 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3580427Z triton_mm_2936 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3580899Z triton_mm_2931 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3581370Z triton_mm_2932 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3581876Z triton_mm_2930 0.0296 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3582348Z triton_mm_2935 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3582816Z triton_mm_2934 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3583299Z triton_mm_2940 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3583786Z triton_mm_2939 0.0346 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3584266Z triton_mm_2937 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3584605Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8411 seconds precompiling for 13 choices 2025-12-04T09:41:44.3584698Z Autotune Choices Stats: 2025-12-04T09:41:44.3585575Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.026784000918269157, "best_triton_pos": 0} 2025-12-04T09:41:44.3585672Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3585758Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3585941Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3586421Z triton_mm_2964 0.0268 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3586892Z triton_mm_2958 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3587362Z triton_mm_2961 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3587835Z triton_mm_2963 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3588308Z triton_mm_2965 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3588786Z triton_mm_2966 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3589264Z triton_mm_2967 0.0276 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3589779Z triton_mm_2957 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3590247Z triton_mm_2959 0.0277 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3590713Z triton_mm_2960 0.0278 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3591081Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.7292 seconds precompiling for 15 choices 2025-12-04T09:41:44.3591262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3591361Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3591498Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3591744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3592688Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3592778Z graph_break [] 2025-12-04T09:41:44.3592888Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3593065Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3593158Z Autotune Choices Stats: 2025-12-04T09:41:44.3593999Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_2996", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3594100Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3594232Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3594341Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3594831Z triton_mm_2996 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3595309Z triton_mm_2997 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3595827Z triton_mm_2987 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3596295Z triton_mm_2990 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3596769Z triton_mm_2989 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3597258Z triton_mm_2986 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3597754Z triton_mm_2988 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3598221Z triton_mm_2991 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3598724Z triton_mm_2992 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3599202Z triton_mm_2993 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3599583Z SingleProcess AUTOTUNE benchmarking takes 0.2017 seconds and 0.6185 seconds precompiling for 15 choices 2025-12-04T09:41:44.3599676Z Autotune Choices Stats: 2025-12-04T09:41:44.3600804Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3600901Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3600993Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3601097Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3601569Z triton_mm_3016 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3602046Z triton_mm_3017 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3602514Z triton_mm_3018 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3602994Z triton_mm_3019 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3603453Z triton_mm_3022 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3603979Z triton_mm_3021 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3604442Z triton_mm_3020 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3604909Z triton_mm_3026 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3605438Z triton_mm_3023 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3605907Z triton_mm_3024 0.0358 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3606241Z SingleProcess AUTOTUNE benchmarking takes 0.1834 seconds and 1.8278 seconds precompiling for 13 choices 2025-12-04T09:41:44.3606331Z Autotune Choices Stats: 2025-12-04T09:41:44.3607178Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3055", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.02672000043094158, "best_triton_pos": 0} 2025-12-04T09:41:44.3607272Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3607358Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3607471Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3608004Z triton_mm_3055 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3608529Z triton_mm_3042 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3608989Z triton_mm_3045 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3609449Z triton_mm_3046 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3609956Z triton_mm_3047 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3610419Z triton_mm_3048 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3610895Z triton_mm_3050 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3611361Z triton_mm_3051 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3611832Z triton_mm_3052 0.0276 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3612302Z triton_mm_3049 0.0278 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3612629Z SingleProcess AUTOTUNE benchmarking takes 0.2019 seconds and 0.6027 seconds precompiling for 15 choices 2025-12-04T09:41:44.3612806Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3612898Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3613071Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3613320Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3614724Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3614851Z graph_break [] 2025-12-04T09:41:44.3614952Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3615130Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3615219Z Autotune Choices Stats: 2025-12-04T09:41:44.3616044Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3072", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3616139Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3616223Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3616326Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3616805Z triton_mm_3072 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3617275Z triton_mm_3077 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3617799Z triton_mm_3081 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3618273Z triton_mm_3082 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3618737Z triton_mm_3078 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3619245Z triton_mm_3073 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3619708Z triton_mm_3074 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3620175Z triton_mm_3075 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3620636Z triton_mm_3076 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3621107Z triton_mm_3079 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3621441Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6365 seconds precompiling for 15 choices 2025-12-04T09:41:44.3621541Z Autotune Choices Stats: 2025-12-04T09:41:44.3622373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3103", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.030719999223947525, "best_triton_pos": 0} 2025-12-04T09:41:44.3622531Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3622622Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3622722Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3623193Z triton_mm_3103 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3623710Z triton_mm_3104 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3624177Z triton_mm_3105 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3624644Z triton_mm_3102 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3625108Z triton_mm_3108 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3625572Z triton_mm_3107 0.0327 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3626035Z triton_mm_3106 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3626508Z triton_mm_3109 0.0348 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3627016Z triton_mm_3110 0.0358 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3627489Z triton_mm_3111 0.0358 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3627868Z SingleProcess AUTOTUNE benchmarking takes 0.1866 seconds and 1.8188 seconds precompiling for 13 choices 2025-12-04T09:41:44.3628045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3628182Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3628312Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3628554Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3629501Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3629582Z graph_break [] 2025-12-04T09:41:44.3629684Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3629861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3629949Z Autotune Choices Stats: 2025-12-04T09:41:44.3630788Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3117", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3630880Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3630967Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3631074Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3631589Z triton_mm_3117 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3632069Z triton_mm_3122 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3632540Z triton_mm_3123 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3633053Z triton_mm_3124 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3633536Z triton_mm_3129 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3634010Z triton_mm_3125 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3634485Z triton_mm_3115 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3634945Z triton_mm_3118 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3635414Z triton_mm_3119 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3635878Z triton_mm_3120 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3636250Z SingleProcess AUTOTUNE benchmarking takes 0.2049 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.3636344Z Autotune Choices Stats: 2025-12-04T09:41:44.3637178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3147", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3637274Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3637358Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3637499Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3638005Z triton_mm_3147 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3638498Z triton_mm_3148 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3638972Z triton_mm_3151 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3639439Z triton_mm_3146 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3639959Z triton_mm_3150 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3640421Z triton_mm_3145 0.0317 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3640928Z triton_mm_3149 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3641403Z triton_mm_3152 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3641870Z triton_mm_3154 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3642385Z triton_mm_3155 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3642709Z SingleProcess AUTOTUNE benchmarking takes 0.1786 seconds and 1.8247 seconds precompiling for 13 choices 2025-12-04T09:41:44.3642802Z Autotune Choices Stats: 2025-12-04T09:41:44.3643626Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3171", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.027583999559283257, "best_triton_pos": 0} 2025-12-04T09:41:44.3643716Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3643804Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3643913Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3644384Z triton_mm_3171 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3644855Z triton_mm_3172 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3645374Z triton_mm_3173 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3645844Z triton_mm_3174 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3646305Z triton_mm_3175 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3646814Z triton_mm_3176 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3647297Z triton_mm_3177 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3647797Z triton_mm_3178 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3648270Z triton_mm_3179 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3648736Z triton_mm_3180 0.0276 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3649067Z SingleProcess AUTOTUNE benchmarking takes 0.2000 seconds and 0.6082 seconds precompiling for 15 choices 2025-12-04T09:41:44.3649236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3649328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3649465Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3649704Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3651110Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3651231Z graph_break [] 2025-12-04T09:41:44.3651333Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3651507Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3651597Z Autotune Choices Stats: 2025-12-04T09:41:44.3652449Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3208", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3652545Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3652629Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3652739Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3653215Z triton_mm_3208 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3653692Z triton_mm_3210 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3654165Z triton_mm_3211 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3654683Z triton_mm_3212 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3655150Z triton_mm_3201 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3655616Z triton_mm_3209 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3656152Z triton_mm_3202 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3656618Z triton_mm_3206 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3657085Z triton_mm_3207 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3657585Z triton_mm_3214 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3657938Z SingleProcess AUTOTUNE benchmarking takes 0.5512 seconds and 0.5884 seconds precompiling for 15 choices 2025-12-04T09:41:44.3658034Z Autotune Choices Stats: 2025-12-04T09:41:44.3658883Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3232", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3658981Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3659065Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3659169Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3659694Z triton_mm_3232 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3660167Z triton_mm_3233 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3660678Z triton_mm_3234 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3661139Z triton_mm_3231 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3661611Z triton_mm_3237 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3662080Z triton_mm_3236 0.0317 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3662541Z triton_mm_3235 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3663014Z triton_mm_3238 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3663480Z triton_mm_3240 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3663993Z triton_mm_3241 0.0347 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3664321Z SingleProcess AUTOTUNE benchmarking takes 0.1808 seconds and 1.8392 seconds precompiling for 13 choices 2025-12-04T09:41:44.3664491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3664590Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3664718Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3664969Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3666376Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3666465Z graph_break [] 2025-12-04T09:41:44.3666569Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3666739Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3666831Z Autotune Choices Stats: 2025-12-04T09:41:44.3667681Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3249", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3667780Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3667889Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3667993Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3668474Z triton_mm_3249 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3668997Z triton_mm_3251 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3669479Z triton_mm_3252 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3669999Z triton_mm_3253 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3670470Z triton_mm_3254 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3670949Z triton_mm_3255 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3671416Z triton_mm_3246 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3671881Z triton_mm_3244 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3672353Z triton_mm_3245 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3672821Z triton_mm_3247 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3673193Z SingleProcess AUTOTUNE benchmarking takes 0.2118 seconds and 0.6062 seconds precompiling for 15 choices 2025-12-04T09:41:44.3673283Z Autotune Choices Stats: 2025-12-04T09:41:44.3674113Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3275", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3674207Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3674294Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3674399Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3674913Z triton_mm_3275 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3675382Z triton_mm_3276 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3675855Z triton_mm_3277 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3676317Z triton_mm_3280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3676785Z triton_mm_3274 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3677277Z triton_mm_3279 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3677785Z triton_mm_3278 0.0328 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3678294Z triton_mm_3283 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3678762Z triton_mm_3284 0.0348 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3679274Z triton_mm_3281 0.0348 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3679654Z SingleProcess AUTOTUNE benchmarking takes 0.1792 seconds and 1.8272 seconds precompiling for 13 choices 2025-12-04T09:41:44.3679832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3679927Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3680057Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3680307Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3681670Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3681761Z graph_break [] 2025-12-04T09:41:44.3681861Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3682031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3682171Z Autotune Choices Stats: 2025-12-04T09:41:44.3683017Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3300", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.3683111Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3683194Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3683298Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3683785Z triton_mm_3300 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3684299Z triton_mm_3288 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3684770Z triton_mm_3289 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3685247Z triton_mm_3294 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3685718Z triton_mm_3295 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3686191Z triton_mm_3296 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3686655Z triton_mm_3287 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3687123Z triton_mm_3290 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3687673Z triton_mm_3291 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3688141Z triton_mm_3292 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3688507Z SingleProcess AUTOTUNE benchmarking takes 0.2039 seconds and 0.6016 seconds precompiling for 15 choices 2025-12-04T09:41:44.3688598Z Autotune Choices Stats: 2025-12-04T09:41:44.3689459Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3318", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.028575999662280083, "best_triton_pos": 0} 2025-12-04T09:41:44.3689552Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3689639Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3689742Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3690218Z triton_mm_3318 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3690694Z triton_mm_3319 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3691164Z triton_mm_3320 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3691630Z triton_mm_3323 0.0287 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3692162Z triton_mm_3317 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3692627Z triton_mm_3322 0.0307 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3693090Z triton_mm_3321 0.0337 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3693601Z triton_mm_3324 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3694077Z triton_mm_3326 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3694553Z triton_mm_3327 0.0348 ms 82.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3694881Z SingleProcess AUTOTUNE benchmarking takes 0.1794 seconds and 1.8359 seconds precompiling for 13 choices 2025-12-04T09:41:44.3695054Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3695152Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3695288Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3695535Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3696952Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3697037Z graph_break [] 2025-12-04T09:41:44.3697139Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3697315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3697409Z Autotune Choices Stats: 2025-12-04T09:41:44.3698348Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3344", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026655999943614006, "best_triton_pos": 0} 2025-12-04T09:41:44.3698440Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3698527Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3698634Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3699116Z triton_mm_3344 0.0267 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3699585Z triton_mm_3332 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3700047Z triton_mm_3333 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3700702Z triton_mm_3339 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3701178Z triton_mm_3341 0.0276 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3701722Z triton_mm_3331 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3702198Z triton_mm_3338 0.0285 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3702661Z triton_mm_3330 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3703189Z triton_mm_3334 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3703658Z triton_mm_3335 0.0287 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3703991Z SingleProcess AUTOTUNE benchmarking takes 0.2046 seconds and 0.6449 seconds precompiling for 15 choices 2025-12-04T09:41:44.3704084Z Autotune Choices Stats: 2025-12-04T09:41:44.3704910Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3366", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.3705007Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3705094Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3705195Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3705673Z triton_mm_3366 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3706139Z triton_mm_3360 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3707009Z triton_mm_3361 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3707487Z triton_mm_3362 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3708018Z triton_mm_3363 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3708488Z triton_mm_3365 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3708965Z triton_mm_3367 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3709434Z triton_mm_3364 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3709904Z triton_mm_3369 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3710379Z triton_mm_3370 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3710708Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.9040 seconds precompiling for 13 choices 2025-12-04T09:41:44.3710923Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3711022Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3711155Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3711403Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3712343Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3712430Z graph_break [] 2025-12-04T09:41:44.3712577Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3712749Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3712844Z Autotune Choices Stats: 2025-12-04T09:41:44.3713680Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3382", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.3713772Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3713857Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3713959Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3714435Z triton_mm_3382 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3714915Z triton_mm_3374 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3715380Z triton_mm_3379 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3715895Z triton_mm_3380 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3716367Z triton_mm_3381 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3716844Z triton_mm_3384 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3717352Z triton_mm_3373 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3717868Z triton_mm_3375 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3718337Z triton_mm_3376 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3718804Z triton_mm_3377 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3719142Z SingleProcess AUTOTUNE benchmarking takes 0.2041 seconds and 0.6036 seconds precompiling for 15 choices 2025-12-04T09:41:44.3719233Z Autotune Choices Stats: 2025-12-04T09:41:44.3720120Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3406", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02860799990594387, "best_triton_pos": 0} 2025-12-04T09:41:44.3720258Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3720343Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3720451Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3720930Z triton_mm_3406 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3721400Z triton_mm_3404 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3721906Z triton_mm_3405 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3722372Z triton_mm_3409 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3722841Z triton_mm_3403 0.0297 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3723305Z triton_mm_3408 0.0307 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3723769Z triton_mm_3407 0.0328 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3724243Z triton_mm_3410 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3724714Z triton_mm_3412 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3725227Z triton_mm_3413 0.0338 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3725554Z SingleProcess AUTOTUNE benchmarking takes 0.1782 seconds and 1.8691 seconds precompiling for 13 choices 2025-12-04T09:41:44.3725647Z Autotune Choices Stats: 2025-12-04T09:41:44.3726480Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3437", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.027615999802947044, "best_triton_pos": 0} 2025-12-04T09:41:44.3726642Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3726728Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3726840Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3727314Z triton_mm_3437 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3727830Z triton_mm_3438 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3728302Z triton_mm_3440 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3728778Z triton_mm_3442 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3729254Z triton_mm_3443 0.0276 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3729770Z triton_mm_3430 0.0277 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3730235Z triton_mm_3435 0.0278 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3730701Z triton_mm_3429 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3731199Z triton_mm_3433 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3731673Z triton_mm_3436 0.0287 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3732005Z SingleProcess AUTOTUNE benchmarking takes 0.2029 seconds and 0.6051 seconds precompiling for 15 choices 2025-12-04T09:41:44.3732178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3732274Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3732401Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3732642Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3733585Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3733667Z graph_break [] 2025-12-04T09:41:44.3733773Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3733945Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3734034Z Autotune Choices Stats: 2025-12-04T09:41:44.3734906Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3462", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3734999Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3735086Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3735226Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3735702Z triton_mm_3462 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3736178Z triton_mm_3466 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3736654Z triton_mm_3467 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3737139Z triton_mm_3472 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3737600Z triton_mm_3459 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3738080Z triton_mm_3460 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3738543Z triton_mm_3461 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3739048Z triton_mm_3463 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3739515Z triton_mm_3465 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3739982Z triton_mm_3468 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3740350Z SingleProcess AUTOTUNE benchmarking takes 0.2018 seconds and 0.5791 seconds precompiling for 15 choices 2025-12-04T09:41:44.3740442Z Autotune Choices Stats: 2025-12-04T09:41:44.3741278Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3489", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2", "best_time": 0.028672000393271446, "best_triton_pos": 0} 2025-12-04T09:41:44.3741379Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3741462Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3741566Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3742039Z triton_mm_3489 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3742516Z triton_mm_3490 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3742989Z triton_mm_3491 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3743457Z triton_mm_3492 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3743968Z triton_mm_3495 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3744431Z triton_mm_3494 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3744945Z triton_mm_3493 0.0328 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3745415Z triton_mm_3498 0.0338 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3745887Z triton_mm_3496 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3746356Z triton_mm_3499 0.0348 ms 82.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3746680Z SingleProcess AUTOTUNE benchmarking takes 0.1783 seconds and 1.8202 seconds precompiling for 13 choices 2025-12-04T09:41:44.3746775Z Autotune Choices Stats: 2025-12-04T09:41:44.3747606Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3516", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3747745Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3747848Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3747970Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3748466Z triton_mm_3516 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3748933Z triton_mm_3517 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3749446Z triton_mm_3520 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3749923Z triton_mm_3522 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3750399Z triton_mm_3524 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3750879Z triton_mm_3525 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3751340Z triton_mm_3515 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3751810Z triton_mm_3518 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3752275Z triton_mm_3519 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3752744Z triton_mm_3521 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3753117Z SingleProcess AUTOTUNE benchmarking takes 0.2014 seconds and 0.6213 seconds precompiling for 15 choices 2025-12-04T09:41:44.3753290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3753389Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3753518Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3753802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3754746Z inductor [('triton_bundler_save_kernel', 40), ('generated_module_cache_hit', 14), ('benchmarking.InductorBenchmarker.benchmark', 4), ('benchmarking.InductorBenchmarker.benchmark_gpu', 4), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3754831Z graph_break [] 2025-12-04T09:41:44.3754937Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3755109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3755197Z Autotune Choices Stats: 2025-12-04T09:41:44.3756016Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3547", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3756110Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3756199Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3756305Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3756782Z triton_mm_3547 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3757308Z triton_mm_3552 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3757778Z triton_mm_3556 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3758296Z triton_mm_3548 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3758807Z triton_mm_3546 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3759277Z triton_mm_3555 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3759804Z triton_mm_3549 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3760268Z triton_mm_3550 0.0297 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3760737Z triton_mm_3545 0.0306 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3761206Z triton_mm_3551 0.0307 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3761538Z SingleProcess AUTOTUNE benchmarking takes 0.2059 seconds and 0.6126 seconds precompiling for 15 choices 2025-12-04T09:41:44.3761630Z Autotune Choices Stats: 2025-12-04T09:41:44.3762520Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3577", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.02860799990594387, "best_triton_pos": 0} 2025-12-04T09:41:44.3762617Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3762703Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3762808Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3763324Z triton_mm_3577 0.0286 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3763796Z triton_mm_3576 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3764265Z triton_mm_3578 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3764732Z triton_mm_3581 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3765207Z triton_mm_3575 0.0297 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3765675Z triton_mm_3580 0.0307 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3766143Z triton_mm_3579 0.0328 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3766656Z triton_mm_3584 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3767132Z triton_mm_3585 0.0338 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3767604Z triton_mm_3582 0.0348 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3767970Z SingleProcess AUTOTUNE benchmarking takes 0.1807 seconds and 1.8175 seconds precompiling for 13 choices 2025-12-04T09:41:44.3768064Z Autotune Choices Stats: 2025-12-04T09:41:44.3768907Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3603", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3769001Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3769088Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3769198Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:41:44.3769675Z triton_mm_3603 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3770148Z triton_mm_3607 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3770624Z triton_mm_3612 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3771102Z triton_mm_3611 0.0277 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3771606Z triton_mm_3606 0.0278 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3772082Z triton_mm_3605 0.0285 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3772596Z triton_mm_3608 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3773063Z triton_mm_3601 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3773534Z triton_mm_3602 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3773998Z triton_mm_3604 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3774331Z SingleProcess AUTOTUNE benchmarking takes 0.2030 seconds and 0.6335 seconds precompiling for 15 choices 2025-12-04T09:41:44.3774506Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3774605Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3774737Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3774982Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3776363Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3776487Z graph_break [] 2025-12-04T09:41:44.3776599Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3776772Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3776865Z Autotune Choices Stats: 2025-12-04T09:41:44.3777757Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3638", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3777875Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3777967Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3778090Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3778571Z triton_mm_3638 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3779050Z triton_mm_3639 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3779530Z triton_mm_3642 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3779998Z triton_mm_3634 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3780463Z triton_mm_3636 0.0286 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3780982Z triton_mm_3644 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3781453Z triton_mm_3631 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3781968Z triton_mm_3633 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3782434Z triton_mm_3637 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3782908Z triton_mm_3640 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3783245Z SingleProcess AUTOTUNE benchmarking takes 0.2024 seconds and 0.6077 seconds precompiling for 15 choices 2025-12-04T09:41:44.3783335Z Autotune Choices Stats: 2025-12-04T09:41:44.3784178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3667", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.027775999158620834, "best_triton_pos": 0} 2025-12-04T09:41:44.3784278Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3784361Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3784468Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3784941Z triton_mm_3667 0.0278 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3785458Z triton_mm_3664 0.0286 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3785933Z triton_mm_3662 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3786400Z triton_mm_3663 0.0287 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3786911Z triton_mm_3661 0.0297 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3787377Z triton_mm_3666 0.0307 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3787847Z triton_mm_3665 0.0328 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3788364Z triton_mm_3670 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3788840Z triton_mm_3671 0.0338 ms 82.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3789311Z triton_mm_3668 0.0348 ms 79.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3789640Z SingleProcess AUTOTUNE benchmarking takes 0.1809 seconds and 1.8616 seconds precompiling for 13 choices 2025-12-04T09:41:44.3789816Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3789948Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3790078Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3790326Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3791699Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3791825Z graph_break [] 2025-12-04T09:41:44.3791927Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3792097Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3792190Z Autotune Choices Stats: 2025-12-04T09:41:44.3793020Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3676", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3793115Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3793200Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3793304Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3793780Z triton_mm_3676 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3794251Z triton_mm_3678 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3794775Z triton_mm_3682 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3795255Z triton_mm_3688 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3795722Z triton_mm_3674 0.0286 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3796262Z triton_mm_3675 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3796732Z triton_mm_3677 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3797205Z triton_mm_3679 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3797668Z triton_mm_3680 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3798146Z triton_mm_3681 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3798473Z SingleProcess AUTOTUNE benchmarking takes 0.2086 seconds and 0.6227 seconds precompiling for 15 choices 2025-12-04T09:41:44.3798562Z Autotune Choices Stats: 2025-12-04T09:41:44.3799442Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3705", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027744000777602196, "best_triton_pos": 0} 2025-12-04T09:41:44.3799595Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3799682Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3799784Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3800414Z triton_mm_3705 0.0277 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3800956Z triton_mm_3707 0.0286 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3801424Z triton_mm_3704 0.0287 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3801895Z triton_mm_3706 0.0287 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3802360Z triton_mm_3710 0.0287 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3802824Z triton_mm_3709 0.0307 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3803295Z triton_mm_3708 0.0328 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3803767Z triton_mm_3714 0.0347 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3804300Z triton_mm_3711 0.0348 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3804781Z triton_mm_3713 0.0348 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3805108Z SingleProcess AUTOTUNE benchmarking takes 0.1793 seconds and 1.8239 seconds precompiling for 13 choices 2025-12-04T09:41:44.3805342Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:41:44.3805435Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:41:44.3805571Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:41:44.3805815Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:41:44.3807191Z inductor [('triton_bundler_save_kernel', 168), ('async_compile_cache_miss', 19), ('benchmarking.InductorBenchmarker.benchmark_gpu', 18), ('select_algorithm_num_precompiles', 13), ('generated_module_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:41:44.3807273Z graph_break [] 2025-12-04T09:41:44.3807377Z aten_mm_info [('aten.mm_256_256_256', 1)] 2025-12-04T09:41:44.3807570Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:41:44.3807677Z Autotune Choices Stats: 2025-12-04T09:41:44.3808542Z {"num_choices": 15, "num_triton_choices": 15, "best_kernel": "triton_mm_3731", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4", "best_time": 0.026623999699950218, "best_triton_pos": 0} 2025-12-04T09:41:44.3808636Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3808719Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3808879Z dtypes: torch.float16, torch.float16 2025-12-04T09:41:44.3809365Z triton_mm_3731 0.0266 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3809837Z triton_mm_3717 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3810345Z triton_mm_3720 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3810820Z triton_mm_3725 0.0276 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3811301Z triton_mm_3728 0.0277 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3811772Z triton_mm_3727 0.0278 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3812242Z triton_mm_3719 0.0286 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3812715Z triton_mm_3726 0.0286 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3817557Z triton_mm_3718 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3818094Z triton_mm_3722 0.0287 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3818429Z SingleProcess AUTOTUNE benchmarking takes 0.2056 seconds and 0.6128 seconds precompiling for 15 choices 2025-12-04T09:41:44.3818520Z Autotune Choices Stats: 2025-12-04T09:41:44.3819436Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_mm_3748", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4", "best_time": 0.027648000046610832, "best_triton_pos": 0} 2025-12-04T09:41:44.3819538Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T09:41:44.3819622Z strides: [256, 1], [256, 1] 2025-12-04T09:41:44.3819732Z dtypes: torch.float32, torch.float32 2025-12-04T09:41:44.3820218Z triton_mm_3748 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3820703Z triton_mm_3749 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3821173Z triton_mm_3750 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8 2025-12-04T09:41:44.3821642Z triton_mm_3753 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8 2025-12-04T09:41:44.3822111Z triton_mm_3747 0.0297 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2 2025-12-04T09:41:44.3822618Z triton_mm_3752 0.0316 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3823084Z triton_mm_3751 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4 2025-12-04T09:41:44.3823557Z triton_mm_3754 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3824072Z triton_mm_3756 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4 2025-12-04T09:41:44.3824552Z triton_mm_3757 0.0338 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8 2025-12-04T09:41:44.3824885Z SingleProcess AUTOTUNE benchmarking takes 0.1790 seconds and 1.8860 seconds precompiling for 13 choices 2025-12-04T09:41:44.3825496Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-c842e470cbb98a3c.xml - 2025-12-04T09:41:44.3825639Z =========================== short test summary info ============================ 2025-12-04T09:41:44.3826203Z FAILED [7.1899s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3826293Z Searched string: 2025-12-04T09:41:44.3826426Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3826433Z 2025-12-04T09:41:44.3826550Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3826596Z 2025-12-04T09:41:44.3826727Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3826850Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3826854Z 2025-12-04T09:41:44.3826950Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3827041Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3827133Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3827235Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3827240Z 2025-12-04T09:41:44.3827341Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3827448Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3827549Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3827635Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3827639Z 2025-12-04T09:41:44.3827643Z 2025-12-04T09:41:44.3827845Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3827851Z 2025-12-04T09:41:44.3827855Z 2025-12-04T09:41:44.3827979Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3828098Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3828209Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3828294Z idx_m = rm[:, None] 2025-12-04T09:41:44.3828377Z idx_n = rn[None, :] 2025-12-04T09:41:44.3828470Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3828475Z 2025-12-04T09:41:44.3828570Z # inductor generates a suffix 2025-12-04T09:41:44.3828663Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3828873Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3828960Z ''', device_str='cuda') 2025-12-04T09:41:44.3828969Z 2025-12-04T09:41:44.3828973Z 2025-12-04T09:41:44.3829070Z async_compile.wait(globals()) 2025-12-04T09:41:44.3829151Z del async_compile 2025-12-04T09:41:44.3829156Z 2025-12-04T09:41:44.3829236Z class Runner: 2025-12-04T09:41:44.3829335Z def __init__(self, partitions): 2025-12-04T09:41:44.3829435Z self.partitions = partitions 2025-12-04T09:41:44.3829441Z 2025-12-04T09:41:44.3829551Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3829638Z new_callables = [] 2025-12-04T09:41:44.3829755Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3829904Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3830006Z self.partitions = new_callables 2025-12-04T09:41:44.3830011Z 2025-12-04T09:41:44.3830103Z def call(self, args): 2025-12-04T09:41:44.3830188Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3830266Z args.clear() 2025-12-04T09:41:44.3830396Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3830559Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3830665Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3830766Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3830931Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3831152Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3831251Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3831437Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3831522Z del arg0_1 2025-12-04T09:41:44.3831681Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3831932Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3832032Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3832248Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3832331Z del arg1_1 2025-12-04T09:41:44.3832409Z del buf0 2025-12-04T09:41:44.3832491Z return (buf1, ) 2025-12-04T09:41:44.3832496Z 2025-12-04T09:41:44.3832597Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3832676Z call = runner.call 2025-12-04T09:41:44.3832909Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3832914Z 2025-12-04T09:41:44.3832917Z 2025-12-04T09:41:44.3833056Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3833186Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3833336Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3833533Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3833734Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3833837Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3833999Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3834004Z 2025-12-04T09:41:44.3834049Z 2025-12-04T09:41:44.3834139Z if __name__ == "__main__": 2025-12-04T09:41:44.3834340Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3834500Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3834584Z From CHECK: .to( 2025-12-04T09:41:44.3834589Z 2025-12-04T09:41:44.3834593Z 2025-12-04T09:41:44.3834765Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3837508Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3837521Z 2025-12-04T09:41:44.3837778Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3838343Z FAILED [7.8446s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3838434Z Searched string: 2025-12-04T09:41:44.3838568Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3838573Z 2025-12-04T09:41:44.3838687Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3838697Z 2025-12-04T09:41:44.3838827Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3838951Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3838955Z 2025-12-04T09:41:44.3839048Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3839190Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3839284Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3839399Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3839404Z 2025-12-04T09:41:44.3839586Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3839681Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3839821Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3839915Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3839920Z 2025-12-04T09:41:44.3839923Z 2025-12-04T09:41:44.3840088Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3840093Z 2025-12-04T09:41:44.3840096Z 2025-12-04T09:41:44.3840213Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3840333Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3840443Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3840527Z idx_m = rm[:, None] 2025-12-04T09:41:44.3840616Z idx_n = rn[None, :] 2025-12-04T09:41:44.3840713Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3840718Z 2025-12-04T09:41:44.3840813Z # inductor generates a suffix 2025-12-04T09:41:44.3840910Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3841117Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3841208Z ''', device_str='cuda') 2025-12-04T09:41:44.3841213Z 2025-12-04T09:41:44.3841217Z 2025-12-04T09:41:44.3841314Z async_compile.wait(globals()) 2025-12-04T09:41:44.3841397Z del async_compile 2025-12-04T09:41:44.3841401Z 2025-12-04T09:41:44.3841482Z class Runner: 2025-12-04T09:41:44.3841579Z def __init__(self, partitions): 2025-12-04T09:41:44.3841679Z self.partitions = partitions 2025-12-04T09:41:44.3841733Z 2025-12-04T09:41:44.3841845Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3841935Z new_callables = [] 2025-12-04T09:41:44.3842053Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3842160Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3842264Z self.partitions = new_callables 2025-12-04T09:41:44.3842269Z 2025-12-04T09:41:44.3842362Z def call(self, args): 2025-12-04T09:41:44.3842449Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3842528Z args.clear() 2025-12-04T09:41:44.3842659Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3842786Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3842895Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3842991Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3843154Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3843379Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3843478Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3843663Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3843751Z del arg0_1 2025-12-04T09:41:44.3843909Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3844246Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3844350Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3844568Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3844654Z del arg1_1 2025-12-04T09:41:44.3844733Z del buf0 2025-12-04T09:41:44.3844814Z return (buf1, ) 2025-12-04T09:41:44.3844819Z 2025-12-04T09:41:44.3844919Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3845000Z call = runner.call 2025-12-04T09:41:44.3845159Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3845163Z 2025-12-04T09:41:44.3845170Z 2025-12-04T09:41:44.3845308Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3845483Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3845632Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3845835Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3846034Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3846179Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3846342Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3846347Z 2025-12-04T09:41:44.3846351Z 2025-12-04T09:41:44.3846445Z if __name__ == "__main__": 2025-12-04T09:41:44.3846645Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3846809Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3846900Z From CHECK: .to( 2025-12-04T09:41:44.3846905Z 2025-12-04T09:41:44.3846909Z 2025-12-04T09:41:44.3847081Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3847630Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3847638Z 2025-12-04T09:41:44.3847867Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3848427Z FAILED [6.1224s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3848513Z Searched string: 2025-12-04T09:41:44.3848644Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3848648Z 2025-12-04T09:41:44.3848763Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3848813Z 2025-12-04T09:41:44.3848941Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3849065Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3849069Z 2025-12-04T09:41:44.3849168Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3849255Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3849347Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3849444Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3849449Z 2025-12-04T09:41:44.3849534Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3849624Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3849719Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3849807Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3849811Z 2025-12-04T09:41:44.3849815Z 2025-12-04T09:41:44.3849973Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3849978Z 2025-12-04T09:41:44.3849981Z 2025-12-04T09:41:44.3850101Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3850218Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3850332Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3850416Z idx_m = rm[:, None] 2025-12-04T09:41:44.3850504Z idx_n = rn[None, :] 2025-12-04T09:41:44.3850598Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3850603Z 2025-12-04T09:41:44.3850758Z # inductor generates a suffix 2025-12-04T09:41:44.3850857Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3851072Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3851159Z ''', device_str='cuda') 2025-12-04T09:41:44.3851164Z 2025-12-04T09:41:44.3851174Z 2025-12-04T09:41:44.3851271Z async_compile.wait(globals()) 2025-12-04T09:41:44.3851352Z del async_compile 2025-12-04T09:41:44.3851357Z 2025-12-04T09:41:44.3851438Z class Runner: 2025-12-04T09:41:44.3851537Z def __init__(self, partitions): 2025-12-04T09:41:44.3851637Z self.partitions = partitions 2025-12-04T09:41:44.3851644Z 2025-12-04T09:41:44.3851756Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3851845Z new_callables = [] 2025-12-04T09:41:44.3851959Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3852106Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3852209Z self.partitions = new_callables 2025-12-04T09:41:44.3852214Z 2025-12-04T09:41:44.3852307Z def call(self, args): 2025-12-04T09:41:44.3852394Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3852473Z args.clear() 2025-12-04T09:41:44.3852599Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3852768Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3852873Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3852972Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3853134Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3853354Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3853452Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3853640Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3853728Z del arg0_1 2025-12-04T09:41:44.3853887Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3854143Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3854243Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3854457Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3854538Z del arg1_1 2025-12-04T09:41:44.3854619Z del buf0 2025-12-04T09:41:44.3854701Z return (buf1, ) 2025-12-04T09:41:44.3854705Z 2025-12-04T09:41:44.3854806Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3854933Z call = runner.call 2025-12-04T09:41:44.3855086Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3855091Z 2025-12-04T09:41:44.3855095Z 2025-12-04T09:41:44.3855233Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3855364Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3855509Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3855710Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3855910Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3856017Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3856178Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3856182Z 2025-12-04T09:41:44.3856186Z 2025-12-04T09:41:44.3856274Z if __name__ == "__main__": 2025-12-04T09:41:44.3856481Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3856640Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3856720Z From CHECK: .to( 2025-12-04T09:41:44.3856729Z 2025-12-04T09:41:44.3856733Z 2025-12-04T09:41:44.3856905Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3857553Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3857558Z 2025-12-04T09:41:44.3857779Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3858308Z FAILED [8.1348s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3858393Z Searched string: 2025-12-04T09:41:44.3858524Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3858529Z 2025-12-04T09:41:44.3858642Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3858649Z 2025-12-04T09:41:44.3858778Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3858902Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3858905Z 2025-12-04T09:41:44.3859067Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3859159Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3859251Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3859348Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3859353Z 2025-12-04T09:41:44.3859439Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3859527Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3859659Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3859747Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3859751Z 2025-12-04T09:41:44.3859755Z 2025-12-04T09:41:44.3859915Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3859919Z 2025-12-04T09:41:44.3859923Z 2025-12-04T09:41:44.3860042Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3860155Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3860273Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3860359Z idx_m = rm[:, None] 2025-12-04T09:41:44.3860444Z idx_n = rn[None, :] 2025-12-04T09:41:44.3860538Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3860542Z 2025-12-04T09:41:44.3860644Z # inductor generates a suffix 2025-12-04T09:41:44.3860738Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3860946Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3861035Z ''', device_str='cuda') 2025-12-04T09:41:44.3861039Z 2025-12-04T09:41:44.3861042Z 2025-12-04T09:41:44.3861144Z async_compile.wait(globals()) 2025-12-04T09:41:44.3861224Z del async_compile 2025-12-04T09:41:44.3861228Z 2025-12-04T09:41:44.3861305Z class Runner: 2025-12-04T09:41:44.3861405Z def __init__(self, partitions): 2025-12-04T09:41:44.3861550Z self.partitions = partitions 2025-12-04T09:41:44.3861554Z 2025-12-04T09:41:44.3861666Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3861754Z new_callables = [] 2025-12-04T09:41:44.3861869Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3861976Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3862077Z self.partitions = new_callables 2025-12-04T09:41:44.3862082Z 2025-12-04T09:41:44.3862172Z def call(self, args): 2025-12-04T09:41:44.3862261Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3862341Z args.clear() 2025-12-04T09:41:44.3862469Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3862596Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3862701Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3862799Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3862960Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3863178Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3863278Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3863466Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3863550Z del arg0_1 2025-12-04T09:41:44.3863712Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3864012Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3864115Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3864332Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3864414Z del arg1_1 2025-12-04T09:41:44.3864495Z del buf0 2025-12-04T09:41:44.3864579Z return (buf1, ) 2025-12-04T09:41:44.3864583Z 2025-12-04T09:41:44.3864682Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3864772Z call = runner.call 2025-12-04T09:41:44.3864926Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3864931Z 2025-12-04T09:41:44.3864934Z 2025-12-04T09:41:44.3865075Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3865243Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3865389Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3865591Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3865793Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3865930Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3866094Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3866098Z 2025-12-04T09:41:44.3866102Z 2025-12-04T09:41:44.3866188Z if __name__ == "__main__": 2025-12-04T09:41:44.3866388Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3866550Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3866631Z From CHECK: .to( 2025-12-04T09:41:44.3866635Z 2025-12-04T09:41:44.3866639Z 2025-12-04T09:41:44.3866814Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3867360Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3867365Z 2025-12-04T09:41:44.3867585Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3868161Z FAILED [6.0937s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3868244Z Searched string: 2025-12-04T09:41:44.3868383Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3868387Z 2025-12-04T09:41:44.3868501Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3868547Z 2025-12-04T09:41:44.3868680Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3868804Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3868808Z 2025-12-04T09:41:44.3868902Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3868994Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3869085Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3869176Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3869183Z 2025-12-04T09:41:44.3869268Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3869357Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3869450Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3869538Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3869542Z 2025-12-04T09:41:44.3869546Z 2025-12-04T09:41:44.3869700Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3869705Z 2025-12-04T09:41:44.3869708Z 2025-12-04T09:41:44.3869835Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3869948Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3870062Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3870146Z idx_m = rm[:, None] 2025-12-04T09:41:44.3870232Z idx_n = rn[None, :] 2025-12-04T09:41:44.3870329Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3870334Z 2025-12-04T09:41:44.3870477Z # inductor generates a suffix 2025-12-04T09:41:44.3870571Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3870786Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3870873Z ''', device_str='cuda') 2025-12-04T09:41:44.3870877Z 2025-12-04T09:41:44.3870881Z 2025-12-04T09:41:44.3870984Z async_compile.wait(globals()) 2025-12-04T09:41:44.3871064Z del async_compile 2025-12-04T09:41:44.3871069Z 2025-12-04T09:41:44.3871146Z class Runner: 2025-12-04T09:41:44.3871249Z def __init__(self, partitions): 2025-12-04T09:41:44.3871354Z self.partitions = partitions 2025-12-04T09:41:44.3871358Z 2025-12-04T09:41:44.3871465Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3871555Z new_callables = [] 2025-12-04T09:41:44.3871669Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3871820Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3871923Z self.partitions = new_callables 2025-12-04T09:41:44.3871927Z 2025-12-04T09:41:44.3872017Z def call(self, args): 2025-12-04T09:41:44.3872109Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3872188Z args.clear() 2025-12-04T09:41:44.3872355Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3872481Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3872585Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3872681Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3872847Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3873066Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3873165Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3873353Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3873434Z del arg0_1 2025-12-04T09:41:44.3873598Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3873852Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3873948Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3874169Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3874249Z del arg1_1 2025-12-04T09:41:44.3874329Z del buf0 2025-12-04T09:41:44.3874411Z return (buf1, ) 2025-12-04T09:41:44.3874415Z 2025-12-04T09:41:44.3874514Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3874645Z call = runner.call 2025-12-04T09:41:44.3874800Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3874804Z 2025-12-04T09:41:44.3874808Z 2025-12-04T09:41:44.3874951Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3875083Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3875228Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3875429Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3875628Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3875733Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3875895Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3875899Z 2025-12-04T09:41:44.3875903Z 2025-12-04T09:41:44.3875990Z if __name__ == "__main__": 2025-12-04T09:41:44.3876191Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3876351Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3876436Z From CHECK: .to( 2025-12-04T09:41:44.3876445Z 2025-12-04T09:41:44.3876448Z 2025-12-04T09:41:44.3876624Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3877209Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3877214Z 2025-12-04T09:41:44.3877433Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3877990Z FAILED [5.6103s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3878083Z Searched string: 2025-12-04T09:41:44.3878231Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3878236Z 2025-12-04T09:41:44.3878349Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3878355Z 2025-12-04T09:41:44.3878485Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3878609Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3878613Z 2025-12-04T09:41:44.3878752Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3878842Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3878933Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3879030Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3879034Z 2025-12-04T09:41:44.3879120Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3879249Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3879346Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3879434Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3879438Z 2025-12-04T09:41:44.3879442Z 2025-12-04T09:41:44.3879663Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3879667Z 2025-12-04T09:41:44.3879671Z 2025-12-04T09:41:44.3879791Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3879904Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3880018Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3880101Z idx_m = rm[:, None] 2025-12-04T09:41:44.3880191Z idx_n = rn[None, :] 2025-12-04T09:41:44.3880288Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3880292Z 2025-12-04T09:41:44.3880390Z # inductor generates a suffix 2025-12-04T09:41:44.3880481Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3880689Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3880775Z ''', device_str='cuda') 2025-12-04T09:41:44.3880779Z 2025-12-04T09:41:44.3880783Z 2025-12-04T09:41:44.3880883Z async_compile.wait(globals()) 2025-12-04T09:41:44.3880963Z del async_compile 2025-12-04T09:41:44.3880967Z 2025-12-04T09:41:44.3881044Z class Runner: 2025-12-04T09:41:44.3881146Z def __init__(self, partitions): 2025-12-04T09:41:44.3881293Z self.partitions = partitions 2025-12-04T09:41:44.3881298Z 2025-12-04T09:41:44.3881409Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3881497Z new_callables = [] 2025-12-04T09:41:44.3881611Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3881718Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3881820Z self.partitions = new_callables 2025-12-04T09:41:44.3881826Z 2025-12-04T09:41:44.3881914Z def call(self, args): 2025-12-04T09:41:44.3882004Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3882083Z args.clear() 2025-12-04T09:41:44.3882210Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3882335Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3882441Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3882539Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3882703Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3882921Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3883023Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3883210Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3883292Z del arg0_1 2025-12-04T09:41:44.3883530Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3883787Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3883886Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3884111Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.3884191Z del arg1_1 2025-12-04T09:41:44.3884272Z del buf0 2025-12-04T09:41:44.3884355Z return (buf1, ) 2025-12-04T09:41:44.3884359Z 2025-12-04T09:41:44.3884458Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3884546Z call = runner.call 2025-12-04T09:41:44.3884699Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3884703Z 2025-12-04T09:41:44.3884707Z 2025-12-04T09:41:44.3884847Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3885019Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3885165Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3885368Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3885569Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3885714Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3885878Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3885883Z 2025-12-04T09:41:44.3885887Z 2025-12-04T09:41:44.3885973Z if __name__ == "__main__": 2025-12-04T09:41:44.3886173Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3886334Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3886416Z From CHECK: .to( 2025-12-04T09:41:44.3886420Z 2025-12-04T09:41:44.3886424Z 2025-12-04T09:41:44.3886604Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3887174Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3887180Z 2025-12-04T09:41:44.3887424Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3887954Z FAILED [7.7698s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3888038Z Searched string: 2025-12-04T09:41:44.3888173Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3888177Z 2025-12-04T09:41:44.3888332Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3888336Z 2025-12-04T09:41:44.3888464Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3888587Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3888591Z 2025-12-04T09:41:44.3888685Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3888777Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3888870Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3888961Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3888969Z 2025-12-04T09:41:44.3889056Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3889148Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3889241Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3889327Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3889332Z 2025-12-04T09:41:44.3889335Z 2025-12-04T09:41:44.3889490Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3889494Z 2025-12-04T09:41:44.3889500Z 2025-12-04T09:41:44.3889621Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3889735Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3889856Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3889940Z idx_m = rm[:, None] 2025-12-04T09:41:44.3890024Z idx_n = rn[None, :] 2025-12-04T09:41:44.3890118Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3890122Z 2025-12-04T09:41:44.3890266Z # inductor generates a suffix 2025-12-04T09:41:44.3890360Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3890577Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3890665Z ''', device_str='cuda') 2025-12-04T09:41:44.3890669Z 2025-12-04T09:41:44.3890673Z 2025-12-04T09:41:44.3890772Z async_compile.wait(globals()) 2025-12-04T09:41:44.3890852Z del async_compile 2025-12-04T09:41:44.3890856Z 2025-12-04T09:41:44.3890933Z class Runner: 2025-12-04T09:41:44.3891036Z def __init__(self, partitions): 2025-12-04T09:41:44.3891139Z self.partitions = partitions 2025-12-04T09:41:44.3891144Z 2025-12-04T09:41:44.3891252Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3891342Z new_callables = [] 2025-12-04T09:41:44.3891500Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3891606Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3891709Z self.partitions = new_callables 2025-12-04T09:41:44.3891715Z 2025-12-04T09:41:44.3891803Z def call(self, args): 2025-12-04T09:41:44.3891894Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3891977Z args.clear() 2025-12-04T09:41:44.3892144Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3892274Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3892378Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3892474Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3892637Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3892856Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3892956Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3893146Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3893226Z del arg0_1 2025-12-04T09:41:44.3893389Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3893638Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3893736Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3893955Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3894034Z del arg1_1 2025-12-04T09:41:44.3894114Z del buf0 2025-12-04T09:41:44.3894196Z return (buf1, ) 2025-12-04T09:41:44.3894200Z 2025-12-04T09:41:44.3894298Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3894430Z call = runner.call 2025-12-04T09:41:44.3894583Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3894588Z 2025-12-04T09:41:44.3894592Z 2025-12-04T09:41:44.3894728Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3894862Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3895006Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3895209Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3895407Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3895508Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3895672Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3895676Z 2025-12-04T09:41:44.3895680Z 2025-12-04T09:41:44.3895766Z if __name__ == "__main__": 2025-12-04T09:41:44.3895965Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3896131Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3896213Z From CHECK: .to( 2025-12-04T09:41:44.3896217Z 2025-12-04T09:41:44.3896221Z 2025-12-04T09:41:44.3896397Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3896984Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3896990Z 2025-12-04T09:41:44.3897204Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3897768Z FAILED [6.1830s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3897865Z Searched string: 2025-12-04T09:41:44.3898012Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3898016Z 2025-12-04T09:41:44.3898133Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3898137Z 2025-12-04T09:41:44.3898261Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3898388Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3898393Z 2025-12-04T09:41:44.3898525Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3898618Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3898711Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3898800Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3898805Z 2025-12-04T09:41:44.3898893Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3899021Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3899110Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3899201Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3899205Z 2025-12-04T09:41:44.3899209Z 2025-12-04T09:41:44.3899363Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3899368Z 2025-12-04T09:41:44.3899374Z 2025-12-04T09:41:44.3899496Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3899610Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3899721Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3899812Z idx_m = rm[:, None] 2025-12-04T09:41:44.3899893Z idx_n = rn[None, :] 2025-12-04T09:41:44.3899984Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3899991Z 2025-12-04T09:41:44.3900088Z # inductor generates a suffix 2025-12-04T09:41:44.3900177Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3900590Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3900680Z ''', device_str='cuda') 2025-12-04T09:41:44.3900684Z 2025-12-04T09:41:44.3900688Z 2025-12-04T09:41:44.3900786Z async_compile.wait(globals()) 2025-12-04T09:41:44.3900870Z del async_compile 2025-12-04T09:41:44.3900874Z 2025-12-04T09:41:44.3900951Z class Runner: 2025-12-04T09:41:44.3901055Z def __init__(self, partitions): 2025-12-04T09:41:44.3901229Z self.partitions = partitions 2025-12-04T09:41:44.3901233Z 2025-12-04T09:41:44.3901340Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3901433Z new_callables = [] 2025-12-04T09:41:44.3901550Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3901652Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3901755Z self.partitions = new_callables 2025-12-04T09:41:44.3901761Z 2025-12-04T09:41:44.3901849Z def call(self, args): 2025-12-04T09:41:44.3901935Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3902018Z args.clear() 2025-12-04T09:41:44.3902145Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3902271Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3902374Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3902468Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3902634Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3902854Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3902950Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3903141Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3903222Z del arg0_1 2025-12-04T09:41:44.3903454Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3903707Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3903806Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3904024Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3904104Z del arg1_1 2025-12-04T09:41:44.3904181Z del buf0 2025-12-04T09:41:44.3904266Z return (buf1, ) 2025-12-04T09:41:44.3904270Z 2025-12-04T09:41:44.3904368Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3904453Z call = runner.call 2025-12-04T09:41:44.3904610Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3904614Z 2025-12-04T09:41:44.3904618Z 2025-12-04T09:41:44.3904811Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3904954Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3905107Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3905311Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3905523Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3905678Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3905843Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3905848Z 2025-12-04T09:41:44.3905852Z 2025-12-04T09:41:44.3905941Z if __name__ == "__main__": 2025-12-04T09:41:44.3906139Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3906305Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3906386Z From CHECK: .to( 2025-12-04T09:41:44.3906391Z 2025-12-04T09:41:44.3906395Z 2025-12-04T09:41:44.3906572Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3907116Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3907120Z 2025-12-04T09:41:44.3907334Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3907868Z FAILED [5.9019s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3907968Z Searched string: 2025-12-04T09:41:44.3908116Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3908127Z 2025-12-04T09:41:44.3908323Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3908327Z 2025-12-04T09:41:44.3908453Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3908581Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3908588Z 2025-12-04T09:41:44.3908680Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3908767Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3908863Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3908956Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3908960Z 2025-12-04T09:41:44.3909055Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3909146Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3909235Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3909326Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3909330Z 2025-12-04T09:41:44.3909334Z 2025-12-04T09:41:44.3909487Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3909491Z 2025-12-04T09:41:44.3909498Z 2025-12-04T09:41:44.3909620Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3909734Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3909846Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3909936Z idx_m = rm[:, None] 2025-12-04T09:41:44.3910019Z idx_n = rn[None, :] 2025-12-04T09:41:44.3910112Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3910164Z 2025-12-04T09:41:44.3910270Z # inductor generates a suffix 2025-12-04T09:41:44.3910362Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3910575Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3910664Z ''', device_str='cuda') 2025-12-04T09:41:44.3910669Z 2025-12-04T09:41:44.3910672Z 2025-12-04T09:41:44.3910771Z async_compile.wait(globals()) 2025-12-04T09:41:44.3910855Z del async_compile 2025-12-04T09:41:44.3910859Z 2025-12-04T09:41:44.3910938Z class Runner: 2025-12-04T09:41:44.3911037Z def __init__(self, partitions): 2025-12-04T09:41:44.3911144Z self.partitions = partitions 2025-12-04T09:41:44.3911148Z 2025-12-04T09:41:44.3911257Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3911347Z new_callables = [] 2025-12-04T09:41:44.3911503Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3911607Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3911713Z self.partitions = new_callables 2025-12-04T09:41:44.3911718Z 2025-12-04T09:41:44.3911804Z def call(self, args): 2025-12-04T09:41:44.3911891Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3912013Z args.clear() 2025-12-04T09:41:44.3912138Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3912261Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3912367Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3912461Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3912627Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3912844Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3912941Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3913134Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3913216Z del arg0_1 2025-12-04T09:41:44.3913376Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3913629Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3913732Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3913950Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.3914032Z del arg1_1 2025-12-04T09:41:44.3914110Z del buf0 2025-12-04T09:41:44.3914194Z return (buf1, ) 2025-12-04T09:41:44.3914198Z 2025-12-04T09:41:44.3914340Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3914422Z call = runner.call 2025-12-04T09:41:44.3914578Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3914583Z 2025-12-04T09:41:44.3914587Z 2025-12-04T09:41:44.3914725Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3914859Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3915005Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3915202Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3915404Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3915506Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3915667Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3915671Z 2025-12-04T09:41:44.3915675Z 2025-12-04T09:41:44.3915764Z if __name__ == "__main__": 2025-12-04T09:41:44.3915964Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3916127Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3916207Z From CHECK: .to( 2025-12-04T09:41:44.3916211Z 2025-12-04T09:41:44.3916215Z 2025-12-04T09:41:44.3916389Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3916987Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3916992Z 2025-12-04T09:41:44.3917208Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3917750Z FAILED [6.5304s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3917832Z Searched string: 2025-12-04T09:41:44.3917963Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3917970Z 2025-12-04T09:41:44.3918088Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3918093Z 2025-12-04T09:41:44.3918217Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3918346Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3918391Z 2025-12-04T09:41:44.3918489Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3918577Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3918674Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3918763Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3918767Z 2025-12-04T09:41:44.3918853Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3918983Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3919072Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3919159Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3919166Z 2025-12-04T09:41:44.3919170Z 2025-12-04T09:41:44.3919325Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3919333Z 2025-12-04T09:41:44.3919337Z 2025-12-04T09:41:44.3919454Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3919631Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3919743Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3919830Z idx_m = rm[:, None] 2025-12-04T09:41:44.3919920Z idx_n = rn[None, :] 2025-12-04T09:41:44.3920013Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3920020Z 2025-12-04T09:41:44.3920118Z # inductor generates a suffix 2025-12-04T09:41:44.3920206Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3920413Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3920503Z ''', device_str='cuda') 2025-12-04T09:41:44.3920507Z 2025-12-04T09:41:44.3920511Z 2025-12-04T09:41:44.3920607Z async_compile.wait(globals()) 2025-12-04T09:41:44.3920689Z del async_compile 2025-12-04T09:41:44.3920696Z 2025-12-04T09:41:44.3920773Z class Runner: 2025-12-04T09:41:44.3920917Z def __init__(self, partitions): 2025-12-04T09:41:44.3921026Z self.partitions = partitions 2025-12-04T09:41:44.3921031Z 2025-12-04T09:41:44.3921138Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3921225Z new_callables = [] 2025-12-04T09:41:44.3921346Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3921451Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3921554Z self.partitions = new_callables 2025-12-04T09:41:44.3921558Z 2025-12-04T09:41:44.3921647Z def call(self, args): 2025-12-04T09:41:44.3921734Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3921819Z args.clear() 2025-12-04T09:41:44.3921942Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3922065Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3922173Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3922268Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3922432Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3922652Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3922748Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3922938Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3923018Z del arg0_1 2025-12-04T09:41:44.3923227Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3923479Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3923578Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3923794Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3923879Z del arg1_1 2025-12-04T09:41:44.3923959Z del buf0 2025-12-04T09:41:44.3924041Z return (buf1, ) 2025-12-04T09:41:44.3924046Z 2025-12-04T09:41:44.3924148Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3924229Z call = runner.call 2025-12-04T09:41:44.3924384Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3924388Z 2025-12-04T09:41:44.3924391Z 2025-12-04T09:41:44.3924570Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3924701Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3924854Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3925052Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3925250Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3925395Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3925557Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3925561Z 2025-12-04T09:41:44.3925565Z 2025-12-04T09:41:44.3925653Z if __name__ == "__main__": 2025-12-04T09:41:44.3925851Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3926012Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3926098Z From CHECK: .to( 2025-12-04T09:41:44.3926102Z 2025-12-04T09:41:44.3926105Z 2025-12-04T09:41:44.3926278Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3926824Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3926828Z 2025-12-04T09:41:44.3927041Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3927595Z FAILED [6.4033s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3927693Z Searched string: 2025-12-04T09:41:44.3927846Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3927893Z 2025-12-04T09:41:44.3928012Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3928016Z 2025-12-04T09:41:44.3928139Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3928264Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3928270Z 2025-12-04T09:41:44.3928365Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3932974Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3933100Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3933197Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3933202Z 2025-12-04T09:41:44.3933293Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3933391Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3933482Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3933575Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3933579Z 2025-12-04T09:41:44.3933583Z 2025-12-04T09:41:44.3933746Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3933753Z 2025-12-04T09:41:44.3933757Z 2025-12-04T09:41:44.3933877Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3933996Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3934109Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3934202Z idx_m = rm[:, None] 2025-12-04T09:41:44.3934289Z idx_n = rn[None, :] 2025-12-04T09:41:44.3934385Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3934461Z 2025-12-04T09:41:44.3934571Z # inductor generates a suffix 2025-12-04T09:41:44.3934662Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3934877Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3934970Z ''', device_str='cuda') 2025-12-04T09:41:44.3934974Z 2025-12-04T09:41:44.3934978Z 2025-12-04T09:41:44.3935077Z async_compile.wait(globals()) 2025-12-04T09:41:44.3935165Z del async_compile 2025-12-04T09:41:44.3935170Z 2025-12-04T09:41:44.3935249Z class Runner: 2025-12-04T09:41:44.3935355Z def __init__(self, partitions): 2025-12-04T09:41:44.3935461Z self.partitions = partitions 2025-12-04T09:41:44.3935465Z 2025-12-04T09:41:44.3935574Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3935662Z new_callables = [] 2025-12-04T09:41:44.3935832Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3935940Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3936049Z self.partitions = new_callables 2025-12-04T09:41:44.3936060Z 2025-12-04T09:41:44.3936150Z def call(self, args): 2025-12-04T09:41:44.3936239Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3936400Z args.clear() 2025-12-04T09:41:44.3936529Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3936656Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3936766Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3936865Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3937033Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3937262Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3937359Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3937563Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3937661Z del arg0_1 2025-12-04T09:41:44.3937852Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3938111Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3938214Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3938433Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3938518Z del arg1_1 2025-12-04T09:41:44.3938597Z del buf0 2025-12-04T09:41:44.3938683Z return (buf1, ) 2025-12-04T09:41:44.3938690Z 2025-12-04T09:41:44.3938836Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3938922Z call = runner.call 2025-12-04T09:41:44.3939086Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3939090Z 2025-12-04T09:41:44.3939094Z 2025-12-04T09:41:44.3939235Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3939367Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3939524Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3939724Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3939929Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3940031Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3940193Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3940198Z 2025-12-04T09:41:44.3940202Z 2025-12-04T09:41:44.3940296Z if __name__ == "__main__": 2025-12-04T09:41:44.3940497Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3940658Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3940743Z From CHECK: .to( 2025-12-04T09:41:44.3940748Z 2025-12-04T09:41:44.3940752Z 2025-12-04T09:41:44.3940927Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3941541Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3941546Z 2025-12-04T09:41:44.3941765Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3942298Z FAILED [7.7990s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3942385Z Searched string: 2025-12-04T09:41:44.3942519Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3942526Z 2025-12-04T09:41:44.3942646Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3942650Z 2025-12-04T09:41:44.3942779Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3942948Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3942953Z 2025-12-04T09:41:44.3943056Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3943145Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3943246Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3943339Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3943344Z 2025-12-04T09:41:44.3943480Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3943573Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3943664Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3943756Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3943760Z 2025-12-04T09:41:44.3943764Z 2025-12-04T09:41:44.3943926Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3943934Z 2025-12-04T09:41:44.3943938Z 2025-12-04T09:41:44.3944058Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3944178Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3944292Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3944381Z idx_m = rm[:, None] 2025-12-04T09:41:44.3944470Z idx_n = rn[None, :] 2025-12-04T09:41:44.3944569Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3944574Z 2025-12-04T09:41:44.3944669Z # inductor generates a suffix 2025-12-04T09:41:44.3944765Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3944978Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3945071Z ''', device_str='cuda') 2025-12-04T09:41:44.3945075Z 2025-12-04T09:41:44.3945079Z 2025-12-04T09:41:44.3945177Z async_compile.wait(globals()) 2025-12-04T09:41:44.3945264Z del async_compile 2025-12-04T09:41:44.3945268Z 2025-12-04T09:41:44.3945349Z class Runner: 2025-12-04T09:41:44.3945494Z def __init__(self, partitions): 2025-12-04T09:41:44.3945598Z self.partitions = partitions 2025-12-04T09:41:44.3945606Z 2025-12-04T09:41:44.3945716Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3945806Z new_callables = [] 2025-12-04T09:41:44.3945929Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3946034Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3946140Z self.partitions = new_callables 2025-12-04T09:41:44.3946144Z 2025-12-04T09:41:44.3946237Z def call(self, args): 2025-12-04T09:41:44.3946326Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3946414Z args.clear() 2025-12-04T09:41:44.3946547Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3946672Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3946783Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3946881Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3947050Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3947271Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3947369Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3947558Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3947643Z del arg0_1 2025-12-04T09:41:44.3947857Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3948112Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3948217Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3948436Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3948520Z del arg1_1 2025-12-04T09:41:44.3948600Z del buf0 2025-12-04T09:41:44.3948686Z return (buf1, ) 2025-12-04T09:41:44.3948693Z 2025-12-04T09:41:44.3948798Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3948884Z call = runner.call 2025-12-04T09:41:44.3949041Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3949045Z 2025-12-04T09:41:44.3949052Z 2025-12-04T09:41:44.3949232Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3949367Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3949522Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3949721Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3949960Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3950066Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3950232Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3950237Z 2025-12-04T09:41:44.3950240Z 2025-12-04T09:41:44.3950331Z if __name__ == "__main__": 2025-12-04T09:41:44.3950534Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3950692Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3950779Z From CHECK: .to( 2025-12-04T09:41:44.3950783Z 2025-12-04T09:41:44.3950787Z 2025-12-04T09:41:44.3950962Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3951507Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3951512Z 2025-12-04T09:41:44.3951727Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3952263Z FAILED [6.3970s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3952350Z Searched string: 2025-12-04T09:41:44.3952484Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3952532Z 2025-12-04T09:41:44.3952653Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3952657Z 2025-12-04T09:41:44.3952785Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3952914Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3952918Z 2025-12-04T09:41:44.3953016Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3953109Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3953205Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3953302Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3953306Z 2025-12-04T09:41:44.3953399Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3953494Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3953586Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3953676Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3953680Z 2025-12-04T09:41:44.3953684Z 2025-12-04T09:41:44.3953843Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3953850Z 2025-12-04T09:41:44.3953854Z 2025-12-04T09:41:44.3953975Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3954093Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3954209Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3954295Z idx_m = rm[:, None] 2025-12-04T09:41:44.3954383Z idx_n = rn[None, :] 2025-12-04T09:41:44.3954541Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3954546Z 2025-12-04T09:41:44.3954645Z # inductor generates a suffix 2025-12-04T09:41:44.3954740Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3954951Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3955038Z ''', device_str='cuda') 2025-12-04T09:41:44.3955042Z 2025-12-04T09:41:44.3955054Z 2025-12-04T09:41:44.3955154Z async_compile.wait(globals()) 2025-12-04T09:41:44.3955237Z del async_compile 2025-12-04T09:41:44.3955241Z 2025-12-04T09:41:44.3955325Z class Runner: 2025-12-04T09:41:44.3955429Z def __init__(self, partitions): 2025-12-04T09:41:44.3955531Z self.partitions = partitions 2025-12-04T09:41:44.3955535Z 2025-12-04T09:41:44.3955646Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3955737Z new_callables = [] 2025-12-04T09:41:44.3955896Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3956006Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3956113Z self.partitions = new_callables 2025-12-04T09:41:44.3956117Z 2025-12-04T09:41:44.3956210Z def call(self, args): 2025-12-04T09:41:44.3956300Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3956423Z args.clear() 2025-12-04T09:41:44.3956555Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3956683Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3956790Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3956890Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3957058Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3957275Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3957378Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3957573Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3957677Z del arg0_1 2025-12-04T09:41:44.3957865Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3958125Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3958228Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3958443Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3958527Z del arg1_1 2025-12-04T09:41:44.3958613Z del buf0 2025-12-04T09:41:44.3958695Z return (buf1, ) 2025-12-04T09:41:44.3958743Z 2025-12-04T09:41:44.3958850Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3958936Z call = runner.call 2025-12-04T09:41:44.3959092Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3959096Z 2025-12-04T09:41:44.3959100Z 2025-12-04T09:41:44.3959246Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3959377Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3959592Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3959795Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3959997Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3960101Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3960264Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3960269Z 2025-12-04T09:41:44.3960273Z 2025-12-04T09:41:44.3960361Z if __name__ == "__main__": 2025-12-04T09:41:44.3960567Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3960725Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3960813Z From CHECK: .to( 2025-12-04T09:41:44.3960817Z 2025-12-04T09:41:44.3960821Z 2025-12-04T09:41:44.3960997Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3961595Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3961600Z 2025-12-04T09:41:44.3961824Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3962351Z FAILED [6.0633s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3962441Z Searched string: 2025-12-04T09:41:44.3962573Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3962580Z 2025-12-04T09:41:44.3962699Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3962703Z 2025-12-04T09:41:44.3962835Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3963027Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3963032Z 2025-12-04T09:41:44.3963129Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3963221Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3963315Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3963414Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3963419Z 2025-12-04T09:41:44.3963557Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3963649Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3963742Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3963834Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3963839Z 2025-12-04T09:41:44.3963843Z 2025-12-04T09:41:44.3964004Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3964011Z 2025-12-04T09:41:44.3964015Z 2025-12-04T09:41:44.3964133Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3964248Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3964367Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3964453Z idx_m = rm[:, None] 2025-12-04T09:41:44.3964539Z idx_n = rn[None, :] 2025-12-04T09:41:44.3964640Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3964644Z 2025-12-04T09:41:44.3964743Z # inductor generates a suffix 2025-12-04T09:41:44.3964839Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3965050Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3965136Z ''', device_str='cuda') 2025-12-04T09:41:44.3965141Z 2025-12-04T09:41:44.3965144Z 2025-12-04T09:41:44.3965245Z async_compile.wait(globals()) 2025-12-04T09:41:44.3965329Z del async_compile 2025-12-04T09:41:44.3965334Z 2025-12-04T09:41:44.3965461Z class Runner: 2025-12-04T09:41:44.3965565Z def __init__(self, partitions): 2025-12-04T09:41:44.3965669Z self.partitions = partitions 2025-12-04T09:41:44.3965673Z 2025-12-04T09:41:44.3965786Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3965881Z new_callables = [] 2025-12-04T09:41:44.3965998Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3966106Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3966213Z self.partitions = new_callables 2025-12-04T09:41:44.3966217Z 2025-12-04T09:41:44.3966308Z def call(self, args): 2025-12-04T09:41:44.3966400Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3966486Z args.clear() 2025-12-04T09:41:44.3966617Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3966742Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3966848Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3966948Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3967115Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3967331Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3967433Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3967631Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3967733Z del arg0_1 2025-12-04T09:41:44.3967968Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3968221Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3968328Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3968546Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3968628Z del arg1_1 2025-12-04T09:41:44.3968711Z del buf0 2025-12-04T09:41:44.3968798Z return (buf1, ) 2025-12-04T09:41:44.3968805Z 2025-12-04T09:41:44.3968905Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3968990Z call = runner.call 2025-12-04T09:41:44.3969147Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3969152Z 2025-12-04T09:41:44.3969155Z 2025-12-04T09:41:44.3969342Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3969475Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3969624Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3969831Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3970073Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3970173Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3970338Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3970343Z 2025-12-04T09:41:44.3970347Z 2025-12-04T09:41:44.3970438Z if __name__ == "__main__": 2025-12-04T09:41:44.3970643Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3970801Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3970884Z From CHECK: .to( 2025-12-04T09:41:44.3970888Z 2025-12-04T09:41:44.3970891Z 2025-12-04T09:41:44.3971071Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3971614Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3971619Z 2025-12-04T09:41:44.3971840Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3972365Z FAILED [6.2010s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3972452Z Searched string: 2025-12-04T09:41:44.3972586Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3972631Z 2025-12-04T09:41:44.3972754Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3972758Z 2025-12-04T09:41:44.3972886Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3973014Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3973019Z 2025-12-04T09:41:44.3973112Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3973207Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3973302Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3973395Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3973406Z 2025-12-04T09:41:44.3973496Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3973591Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3973684Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3973775Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3973779Z 2025-12-04T09:41:44.3973783Z 2025-12-04T09:41:44.3973940Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3973947Z 2025-12-04T09:41:44.3973953Z 2025-12-04T09:41:44.3974074Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3974188Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3974307Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3974393Z idx_m = rm[:, None] 2025-12-04T09:41:44.3974478Z idx_n = rn[None, :] 2025-12-04T09:41:44.3974618Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3974623Z 2025-12-04T09:41:44.3974721Z # inductor generates a suffix 2025-12-04T09:41:44.3974811Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3975025Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3975113Z ''', device_str='cuda') 2025-12-04T09:41:44.3975118Z 2025-12-04T09:41:44.3975121Z 2025-12-04T09:41:44.3975227Z async_compile.wait(globals()) 2025-12-04T09:41:44.3975310Z del async_compile 2025-12-04T09:41:44.3975314Z 2025-12-04T09:41:44.3975398Z class Runner: 2025-12-04T09:41:44.3975503Z def __init__(self, partitions): 2025-12-04T09:41:44.3975606Z self.partitions = partitions 2025-12-04T09:41:44.3975610Z 2025-12-04T09:41:44.3975718Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3975855Z new_callables = [] 2025-12-04T09:41:44.3975975Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3976080Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3976185Z self.partitions = new_callables 2025-12-04T09:41:44.3976189Z 2025-12-04T09:41:44.3976279Z def call(self, args): 2025-12-04T09:41:44.3976415Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3976499Z args.clear() 2025-12-04T09:41:44.3976625Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3976752Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3976859Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3976955Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3977124Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3977341Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3977443Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3977634Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3977717Z del arg0_1 2025-12-04T09:41:44.3977884Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3978137Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3978238Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3978460Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3978542Z del arg1_1 2025-12-04T09:41:44.3978623Z del buf0 2025-12-04T09:41:44.3978708Z return (buf1, ) 2025-12-04T09:41:44.3978754Z 2025-12-04T09:41:44.3978855Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3978944Z call = runner.call 2025-12-04T09:41:44.3979098Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3979102Z 2025-12-04T09:41:44.3979106Z 2025-12-04T09:41:44.3979247Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3979380Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3979529Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3979728Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3979929Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3980028Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3980195Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3980199Z 2025-12-04T09:41:44.3980203Z 2025-12-04T09:41:44.3980293Z if __name__ == "__main__": 2025-12-04T09:41:44.3980506Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3980663Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3980747Z From CHECK: .to( 2025-12-04T09:41:44.3980751Z 2025-12-04T09:41:44.3980757Z 2025-12-04T09:41:44.3980934Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3981518Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3981523Z 2025-12-04T09:41:44.3981744Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3982270Z FAILED [6.2614s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3982354Z Searched string: 2025-12-04T09:41:44.3982488Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3982495Z 2025-12-04T09:41:44.3982610Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3982614Z 2025-12-04T09:41:44.3982740Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3982966Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3982971Z 2025-12-04T09:41:44.3983067Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3983163Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3983257Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3983351Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3983393Z 2025-12-04T09:41:44.3983490Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3983580Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3983670Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3983765Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3983769Z 2025-12-04T09:41:44.3983773Z 2025-12-04T09:41:44.3983931Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3983938Z 2025-12-04T09:41:44.3983941Z 2025-12-04T09:41:44.3984067Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3984181Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3984300Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3984386Z idx_m = rm[:, None] 2025-12-04T09:41:44.3984473Z idx_n = rn[None, :] 2025-12-04T09:41:44.3984572Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3984576Z 2025-12-04T09:41:44.3984676Z # inductor generates a suffix 2025-12-04T09:41:44.3984767Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3984981Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3985071Z ''', device_str='cuda') 2025-12-04T09:41:44.3985076Z 2025-12-04T09:41:44.3985079Z 2025-12-04T09:41:44.3985183Z async_compile.wait(globals()) 2025-12-04T09:41:44.3985264Z del async_compile 2025-12-04T09:41:44.3985269Z 2025-12-04T09:41:44.3985392Z class Runner: 2025-12-04T09:41:44.3985496Z def __init__(self, partitions): 2025-12-04T09:41:44.3985599Z self.partitions = partitions 2025-12-04T09:41:44.3985604Z 2025-12-04T09:41:44.3985713Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3985808Z new_callables = [] 2025-12-04T09:41:44.3985924Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3986034Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3986140Z self.partitions = new_callables 2025-12-04T09:41:44.3986144Z 2025-12-04T09:41:44.3986233Z def call(self, args): 2025-12-04T09:41:44.3986328Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3986411Z args.clear() 2025-12-04T09:41:44.3986537Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3986670Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3986776Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3986874Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3987046Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3987264Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3987365Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3987552Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3987728Z del arg0_1 2025-12-04T09:41:44.3987918Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3988168Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3988268Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3988487Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3988571Z del arg1_1 2025-12-04T09:41:44.3988652Z del buf0 2025-12-04T09:41:44.3988738Z return (buf1, ) 2025-12-04T09:41:44.3988745Z 2025-12-04T09:41:44.3988845Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3988928Z call = runner.call 2025-12-04T09:41:44.3989084Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3989088Z 2025-12-04T09:41:44.3989092Z 2025-12-04T09:41:44.3989274Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3989409Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3989562Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3989761Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3990007Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.3990107Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.3990273Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.3990278Z 2025-12-04T09:41:44.3990282Z 2025-12-04T09:41:44.3990374Z if __name__ == "__main__": 2025-12-04T09:41:44.3990580Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.3990740Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.3990822Z From CHECK: .to( 2025-12-04T09:41:44.3990826Z 2025-12-04T09:41:44.3990832Z 2025-12-04T09:41:44.3991008Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.3991548Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.3991555Z 2025-12-04T09:41:44.3991769Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.3992311Z FAILED [6.0992s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.3992397Z Searched string: 2025-12-04T09:41:44.3992534Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.3992579Z 2025-12-04T09:41:44.3992696Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.3992700Z 2025-12-04T09:41:44.3992829Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3992961Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.3992965Z 2025-12-04T09:41:44.3993065Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.3993158Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.3993256Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3993347Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.3993354Z 2025-12-04T09:41:44.3993447Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.3993539Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.3993630Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3993723Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.3993727Z 2025-12-04T09:41:44.3993731Z 2025-12-04T09:41:44.3993887Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.3993894Z 2025-12-04T09:41:44.3993898Z 2025-12-04T09:41:44.3994020Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.3994136Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.3994252Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.3994342Z idx_m = rm[:, None] 2025-12-04T09:41:44.3994427Z idx_n = rn[None, :] 2025-12-04T09:41:44.3994561Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.3994566Z 2025-12-04T09:41:44.3994671Z # inductor generates a suffix 2025-12-04T09:41:44.3994765Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.3994979Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.3995067Z ''', device_str='cuda') 2025-12-04T09:41:44.3995072Z 2025-12-04T09:41:44.3995075Z 2025-12-04T09:41:44.3995173Z async_compile.wait(globals()) 2025-12-04T09:41:44.3995257Z del async_compile 2025-12-04T09:41:44.3995264Z 2025-12-04T09:41:44.3995342Z class Runner: 2025-12-04T09:41:44.3995442Z def __init__(self, partitions): 2025-12-04T09:41:44.3995548Z self.partitions = partitions 2025-12-04T09:41:44.3995552Z 2025-12-04T09:41:44.3995662Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.3995795Z new_callables = [] 2025-12-04T09:41:44.3995914Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.3996022Z new_callables.append(fn(c)) 2025-12-04T09:41:44.3996133Z self.partitions = new_callables 2025-12-04T09:41:44.3996137Z 2025-12-04T09:41:44.3996227Z def call(self, args): 2025-12-04T09:41:44.3996361Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.3996448Z args.clear() 2025-12-04T09:41:44.3996574Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3996701Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.3996809Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.3996907Z torch.cuda.set_device(0) 2025-12-04T09:41:44.3997079Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3997295Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.3997399Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3997592Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.3997686Z del arg0_1 2025-12-04T09:41:44.3997875Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.3998146Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.3998246Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.3998465Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.3998547Z del arg1_1 2025-12-04T09:41:44.3998627Z del buf0 2025-12-04T09:41:44.3998761Z return (buf1, ) 2025-12-04T09:41:44.3998765Z 2025-12-04T09:41:44.3998867Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.3998951Z call = runner.call 2025-12-04T09:41:44.3999109Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.3999114Z 2025-12-04T09:41:44.3999120Z 2025-12-04T09:41:44.3999259Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.3999394Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.3999595Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.3999792Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.3999998Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4000097Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4000416Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4000427Z 2025-12-04T09:41:44.4000431Z 2025-12-04T09:41:44.4000527Z if __name__ == "__main__": 2025-12-04T09:41:44.4000731Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4000892Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4000976Z From CHECK: .to( 2025-12-04T09:41:44.4000980Z 2025-12-04T09:41:44.4000986Z 2025-12-04T09:41:44.4001162Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4001786Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4001794Z 2025-12-04T09:41:44.4002013Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4002543Z FAILED [6.1381s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4002628Z Searched string: 2025-12-04T09:41:44.4002762Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4002773Z 2025-12-04T09:41:44.4002889Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4002893Z 2025-12-04T09:41:44.4003018Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4003203Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4003208Z 2025-12-04T09:41:44.4003302Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4003395Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4003494Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4003587Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4003646Z 2025-12-04T09:41:44.4003742Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4003834Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4003927Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4004022Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4004026Z 2025-12-04T09:41:44.4004030Z 2025-12-04T09:41:44.4004187Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4004194Z 2025-12-04T09:41:44.4004198Z 2025-12-04T09:41:44.4004322Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4004438Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4004554Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4004644Z idx_m = rm[:, None] 2025-12-04T09:41:44.4004729Z idx_n = rn[None, :] 2025-12-04T09:41:44.4004826Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4004830Z 2025-12-04T09:41:44.4004932Z # inductor generates a suffix 2025-12-04T09:41:44.4005024Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4005234Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4005328Z ''', device_str='cuda') 2025-12-04T09:41:44.4005332Z 2025-12-04T09:41:44.4005336Z 2025-12-04T09:41:44.4005436Z async_compile.wait(globals()) 2025-12-04T09:41:44.4005523Z del async_compile 2025-12-04T09:41:44.4005585Z 2025-12-04T09:41:44.4005667Z class Runner: 2025-12-04T09:41:44.4005773Z def __init__(self, partitions): 2025-12-04T09:41:44.4005879Z self.partitions = partitions 2025-12-04T09:41:44.4005884Z 2025-12-04T09:41:44.4005996Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4006090Z new_callables = [] 2025-12-04T09:41:44.4006213Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4006320Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4006428Z self.partitions = new_callables 2025-12-04T09:41:44.4006433Z 2025-12-04T09:41:44.4006522Z def call(self, args): 2025-12-04T09:41:44.4006616Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4006702Z args.clear() 2025-12-04T09:41:44.4006828Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4006952Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4007061Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4007160Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4007326Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4007547Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4007648Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4007841Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4007968Z del arg0_1 2025-12-04T09:41:44.4008132Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4008385Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4008487Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4008706Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4008791Z del arg1_1 2025-12-04T09:41:44.4008870Z del buf0 2025-12-04T09:41:44.4008960Z return (buf1, ) 2025-12-04T09:41:44.4008964Z 2025-12-04T09:41:44.4009065Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4009151Z call = runner.call 2025-12-04T09:41:44.4009311Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4009316Z 2025-12-04T09:41:44.4009359Z 2025-12-04T09:41:44.4009503Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4009639Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4009787Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4009986Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4010234Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4010333Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4010496Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4010501Z 2025-12-04T09:41:44.4010505Z 2025-12-04T09:41:44.4010600Z if __name__ == "__main__": 2025-12-04T09:41:44.4010800Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4010960Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4011042Z From CHECK: .to( 2025-12-04T09:41:44.4011049Z 2025-12-04T09:41:44.4011053Z 2025-12-04T09:41:44.4011227Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4011785Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4011792Z 2025-12-04T09:41:44.4012007Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4012540Z FAILED [6.0758s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4012627Z Searched string: 2025-12-04T09:41:44.4012829Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4012833Z 2025-12-04T09:41:44.4012952Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4012956Z 2025-12-04T09:41:44.4013083Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4013216Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4013221Z 2025-12-04T09:41:44.4013316Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4013407Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4013505Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4013597Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4013604Z 2025-12-04T09:41:44.4013693Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4013786Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4013878Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4013968Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4013974Z 2025-12-04T09:41:44.4013978Z 2025-12-04T09:41:44.4014134Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4014141Z 2025-12-04T09:41:44.4014145Z 2025-12-04T09:41:44.4014265Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4014385Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4014501Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4014587Z idx_m = rm[:, None] 2025-12-04T09:41:44.4014719Z idx_n = rn[None, :] 2025-12-04T09:41:44.4014814Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4014818Z 2025-12-04T09:41:44.4014920Z # inductor generates a suffix 2025-12-04T09:41:44.4015014Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4015227Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4015317Z ''', device_str='cuda') 2025-12-04T09:41:44.4015321Z 2025-12-04T09:41:44.4015325Z 2025-12-04T09:41:44.4015423Z async_compile.wait(globals()) 2025-12-04T09:41:44.4015505Z del async_compile 2025-12-04T09:41:44.4015512Z 2025-12-04T09:41:44.4015594Z class Runner: 2025-12-04T09:41:44.4015695Z def __init__(self, partitions): 2025-12-04T09:41:44.4015801Z self.partitions = partitions 2025-12-04T09:41:44.4015805Z 2025-12-04T09:41:44.4015915Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4016047Z new_callables = [] 2025-12-04T09:41:44.4016170Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4016276Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4016380Z self.partitions = new_callables 2025-12-04T09:41:44.4016385Z 2025-12-04T09:41:44.4016477Z def call(self, args): 2025-12-04T09:41:44.4016609Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4016699Z args.clear() 2025-12-04T09:41:44.4016828Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4016954Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4017065Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4017166Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4017330Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4017549Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4017664Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4017882Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4017973Z del arg0_1 2025-12-04T09:41:44.4018136Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4018392Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4018492Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4018709Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4018795Z del arg1_1 2025-12-04T09:41:44.4018876Z del buf0 2025-12-04T09:41:44.4019002Z return (buf1, ) 2025-12-04T09:41:44.4019006Z 2025-12-04T09:41:44.4019111Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4019195Z call = runner.call 2025-12-04T09:41:44.4019354Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4019358Z 2025-12-04T09:41:44.4019364Z 2025-12-04T09:41:44.4019502Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4019635Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4019786Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4019984Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4020185Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4020288Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4020452Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4020457Z 2025-12-04T09:41:44.4020461Z 2025-12-04T09:41:44.4020555Z if __name__ == "__main__": 2025-12-04T09:41:44.4020756Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4020913Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4021000Z From CHECK: .to( 2025-12-04T09:41:44.4021006Z 2025-12-04T09:41:44.4021010Z 2025-12-04T09:41:44.4021185Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4021767Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4021774Z 2025-12-04T09:41:44.4021991Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4022526Z FAILED [5.9318s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4022612Z Searched string: 2025-12-04T09:41:44.4022748Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4022752Z 2025-12-04T09:41:44.4022872Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4022876Z 2025-12-04T09:41:44.4023003Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4023168Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4023172Z 2025-12-04T09:41:44.4023274Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4023365Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4023459Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4023555Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4023598Z 2025-12-04T09:41:44.4023691Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4023785Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4023874Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4023963Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4023967Z 2025-12-04T09:41:44.4023971Z 2025-12-04T09:41:44.4024133Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4024138Z 2025-12-04T09:41:44.4024142Z 2025-12-04T09:41:44.4024259Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4024377Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4024491Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4024580Z idx_m = rm[:, None] 2025-12-04T09:41:44.4024671Z idx_n = rn[None, :] 2025-12-04T09:41:44.4024766Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4024770Z 2025-12-04T09:41:44.4024867Z # inductor generates a suffix 2025-12-04T09:41:44.4024963Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4025169Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4025260Z ''', device_str='cuda') 2025-12-04T09:41:44.4025264Z 2025-12-04T09:41:44.4025268Z 2025-12-04T09:41:44.4025366Z async_compile.wait(globals()) 2025-12-04T09:41:44.4025451Z del async_compile 2025-12-04T09:41:44.4025498Z 2025-12-04T09:41:44.4025584Z class Runner: 2025-12-04T09:41:44.4025689Z def __init__(self, partitions): 2025-12-04T09:41:44.4025792Z self.partitions = partitions 2025-12-04T09:41:44.4025796Z 2025-12-04T09:41:44.4025911Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4026002Z new_callables = [] 2025-12-04T09:41:44.4026123Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4026230Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4026336Z self.partitions = new_callables 2025-12-04T09:41:44.4026340Z 2025-12-04T09:41:44.4026438Z def call(self, args): 2025-12-04T09:41:44.4026528Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4026614Z args.clear() 2025-12-04T09:41:44.4026743Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4026867Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4026973Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4027075Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4027239Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4027459Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4027562Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4027751Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4027903Z del arg0_1 2025-12-04T09:41:44.4028090Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4028347Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4028450Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4028666Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.4028752Z del arg1_1 2025-12-04T09:41:44.4028831Z del buf0 2025-12-04T09:41:44.4028918Z return (buf1, ) 2025-12-04T09:41:44.4028922Z 2025-12-04T09:41:44.4029024Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4029111Z call = runner.call 2025-12-04T09:41:44.4029266Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4029310Z 2025-12-04T09:41:44.4029314Z 2025-12-04T09:41:44.4029462Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4029595Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4029745Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4029943Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4030179Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4030285Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4030447Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4030451Z 2025-12-04T09:41:44.4030457Z 2025-12-04T09:41:44.4030547Z if __name__ == "__main__": 2025-12-04T09:41:44.4030748Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4030907Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4030993Z From CHECK: .to( 2025-12-04T09:41:44.4030999Z 2025-12-04T09:41:44.4031003Z 2025-12-04T09:41:44.4031178Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4031727Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4031738Z 2025-12-04T09:41:44.4031950Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4032483Z FAILED [7.8411s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4032572Z Searched string: 2025-12-04T09:41:44.4032746Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4032750Z 2025-12-04T09:41:44.4032866Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4032871Z 2025-12-04T09:41:44.4033003Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4033130Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4033134Z 2025-12-04T09:41:44.4033233Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4033324Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4033417Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4033520Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4033524Z 2025-12-04T09:41:44.4033613Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4033706Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4033800Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4033891Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4033895Z 2025-12-04T09:41:44.4033899Z 2025-12-04T09:41:44.4034061Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4034065Z 2025-12-04T09:41:44.4034069Z 2025-12-04T09:41:44.4034188Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4034307Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4034426Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4034514Z idx_m = rm[:, None] 2025-12-04T09:41:44.4034645Z idx_n = rn[None, :] 2025-12-04T09:41:44.4034742Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4034746Z 2025-12-04T09:41:44.4034844Z # inductor generates a suffix 2025-12-04T09:41:44.4034941Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4035150Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4035236Z ''', device_str='cuda') 2025-12-04T09:41:44.4035240Z 2025-12-04T09:41:44.4035244Z 2025-12-04T09:41:44.4035347Z async_compile.wait(globals()) 2025-12-04T09:41:44.4035427Z del async_compile 2025-12-04T09:41:44.4035434Z 2025-12-04T09:41:44.4035518Z class Runner: 2025-12-04T09:41:44.4035619Z def __init__(self, partitions): 2025-12-04T09:41:44.4035724Z self.partitions = partitions 2025-12-04T09:41:44.4035728Z 2025-12-04T09:41:44.4035884Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4035975Z new_callables = [] 2025-12-04T09:41:44.4036093Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4036202Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4036307Z self.partitions = new_callables 2025-12-04T09:41:44.4036311Z 2025-12-04T09:41:44.4036474Z def call(self, args): 2025-12-04T09:41:44.4036569Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4041018Z args.clear() 2025-12-04T09:41:44.4041173Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4041303Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4041412Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4041520Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4041686Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4041904Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4042009Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4042202Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4042285Z del arg0_1 2025-12-04T09:41:44.4042452Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4042705Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4042806Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4043024Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4043109Z del arg1_1 2025-12-04T09:41:44.4043189Z del buf0 2025-12-04T09:41:44.4043345Z return (buf1, ) 2025-12-04T09:41:44.4043350Z 2025-12-04T09:41:44.4043454Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4043541Z call = runner.call 2025-12-04T09:41:44.4043698Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4043705Z 2025-12-04T09:41:44.4043709Z 2025-12-04T09:41:44.4043851Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4043986Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4044133Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4044337Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4044539Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4044638Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4044801Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4044806Z 2025-12-04T09:41:44.4044816Z 2025-12-04T09:41:44.4044905Z if __name__ == "__main__": 2025-12-04T09:41:44.4045112Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4045269Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4045354Z From CHECK: .to( 2025-12-04T09:41:44.4045358Z 2025-12-04T09:41:44.4045362Z 2025-12-04T09:41:44.4045585Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4046139Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4046146Z 2025-12-04T09:41:44.4046364Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4046905Z FAILED [6.7429s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4046990Z Searched string: 2025-12-04T09:41:44.4047130Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4047134Z 2025-12-04T09:41:44.4047250Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4047254Z 2025-12-04T09:41:44.4047424Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4047558Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4047563Z 2025-12-04T09:41:44.4047677Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4047782Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4047889Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4048023Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4048030Z 2025-12-04T09:41:44.4048119Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4048208Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4048300Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4048390Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4048395Z 2025-12-04T09:41:44.4048398Z 2025-12-04T09:41:44.4048561Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4048566Z 2025-12-04T09:41:44.4048573Z 2025-12-04T09:41:44.4048692Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4048808Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4048922Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4049009Z idx_m = rm[:, None] 2025-12-04T09:41:44.4049096Z idx_n = rn[None, :] 2025-12-04T09:41:44.4049191Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4049196Z 2025-12-04T09:41:44.4049294Z # inductor generates a suffix 2025-12-04T09:41:44.4049387Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4049599Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4049685Z ''', device_str='cuda') 2025-12-04T09:41:44.4049689Z 2025-12-04T09:41:44.4049693Z 2025-12-04T09:41:44.4049795Z async_compile.wait(globals()) 2025-12-04T09:41:44.4049919Z del async_compile 2025-12-04T09:41:44.4049924Z 2025-12-04T09:41:44.4050003Z class Runner: 2025-12-04T09:41:44.4050107Z def __init__(self, partitions): 2025-12-04T09:41:44.4050208Z self.partitions = partitions 2025-12-04T09:41:44.4050213Z 2025-12-04T09:41:44.4050325Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4050417Z new_callables = [] 2025-12-04T09:41:44.4050533Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4050643Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4050747Z self.partitions = new_callables 2025-12-04T09:41:44.4050751Z 2025-12-04T09:41:44.4050840Z def call(self, args): 2025-12-04T09:41:44.4050931Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4051011Z args.clear() 2025-12-04T09:41:44.4051137Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4051264Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4051369Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4051467Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4051634Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4051850Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4051950Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4052178Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4052265Z del arg0_1 2025-12-04T09:41:44.4052429Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4052686Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4052783Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4053004Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4053085Z del arg1_1 2025-12-04T09:41:44.4053170Z del buf0 2025-12-04T09:41:44.4053255Z return (buf1, ) 2025-12-04T09:41:44.4053259Z 2025-12-04T09:41:44.4053360Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4053447Z call = runner.call 2025-12-04T09:41:44.4053602Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4053647Z 2025-12-04T09:41:44.4053652Z 2025-12-04T09:41:44.4053795Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4053931Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4054077Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4054320Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4054520Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4054620Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4054789Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4054794Z 2025-12-04T09:41:44.4054800Z 2025-12-04T09:41:44.4054890Z if __name__ == "__main__": 2025-12-04T09:41:44.4055093Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4055252Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4055341Z From CHECK: .to( 2025-12-04T09:41:44.4055346Z 2025-12-04T09:41:44.4055349Z 2025-12-04T09:41:44.4055529Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4056071Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4056078Z 2025-12-04T09:41:44.4056293Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4056821Z FAILED [6.0577s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4056948Z Searched string: 2025-12-04T09:41:44.4057084Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4057089Z 2025-12-04T09:41:44.4057205Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4057210Z 2025-12-04T09:41:44.4057338Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4057466Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4057470Z 2025-12-04T09:41:44.4057575Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4057685Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4057793Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4057897Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4057902Z 2025-12-04T09:41:44.4057991Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4058081Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4058171Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4058263Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4058267Z 2025-12-04T09:41:44.4058273Z 2025-12-04T09:41:44.4058431Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4058435Z 2025-12-04T09:41:44.4058439Z 2025-12-04T09:41:44.4058559Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4058675Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4058788Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4058879Z idx_m = rm[:, None] 2025-12-04T09:41:44.4059007Z idx_n = rn[None, :] 2025-12-04T09:41:44.4059107Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4059111Z 2025-12-04T09:41:44.4059208Z # inductor generates a suffix 2025-12-04T09:41:44.4059300Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4059515Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4059602Z ''', device_str='cuda') 2025-12-04T09:41:44.4059606Z 2025-12-04T09:41:44.4059610Z 2025-12-04T09:41:44.4059708Z async_compile.wait(globals()) 2025-12-04T09:41:44.4059793Z del async_compile 2025-12-04T09:41:44.4059798Z 2025-12-04T09:41:44.4059874Z class Runner: 2025-12-04T09:41:44.4059977Z def __init__(self, partitions): 2025-12-04T09:41:44.4060079Z self.partitions = partitions 2025-12-04T09:41:44.4060084Z 2025-12-04T09:41:44.4060233Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4060325Z new_callables = [] 2025-12-04T09:41:44.4060444Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4060550Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4060657Z self.partitions = new_callables 2025-12-04T09:41:44.4060700Z 2025-12-04T09:41:44.4060792Z def call(self, args): 2025-12-04T09:41:44.4060881Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4060961Z args.clear() 2025-12-04T09:41:44.4061090Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4061216Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4061322Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4061420Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4061586Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4061804Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4061902Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4062095Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4062176Z del arg0_1 2025-12-04T09:41:44.4062340Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4062591Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4062690Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4062909Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4062991Z del arg1_1 2025-12-04T09:41:44.4063118Z del buf0 2025-12-04T09:41:44.4063205Z return (buf1, ) 2025-12-04T09:41:44.4063210Z 2025-12-04T09:41:44.4063309Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4063398Z call = runner.call 2025-12-04T09:41:44.4063555Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4063559Z 2025-12-04T09:41:44.4063563Z 2025-12-04T09:41:44.4063700Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4063835Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4063983Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4064181Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4064381Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4064480Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4064645Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4064652Z 2025-12-04T09:41:44.4064656Z 2025-12-04T09:41:44.4064745Z if __name__ == "__main__": 2025-12-04T09:41:44.4064945Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4065105Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4065191Z From CHECK: .to( 2025-12-04T09:41:44.4065195Z 2025-12-04T09:41:44.4065199Z 2025-12-04T09:41:44.4065425Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4065971Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4065979Z 2025-12-04T09:41:44.4066196Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4066723Z FAILED [7.1978s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4066810Z Searched string: 2025-12-04T09:41:44.4066944Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4066948Z 2025-12-04T09:41:44.4067065Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4067069Z 2025-12-04T09:41:44.4067260Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4067388Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4067393Z 2025-12-04T09:41:44.4067488Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4067578Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4067674Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4067811Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4067816Z 2025-12-04T09:41:44.4067906Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4067997Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4068088Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4068180Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4068184Z 2025-12-04T09:41:44.4068190Z 2025-12-04T09:41:44.4068346Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4068351Z 2025-12-04T09:41:44.4068355Z 2025-12-04T09:41:44.4068476Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4068593Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4068705Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4068796Z idx_m = rm[:, None] 2025-12-04T09:41:44.4068882Z idx_n = rn[None, :] 2025-12-04T09:41:44.4068974Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4068979Z 2025-12-04T09:41:44.4069080Z # inductor generates a suffix 2025-12-04T09:41:44.4069169Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4069380Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4069466Z ''', device_str='cuda') 2025-12-04T09:41:44.4069471Z 2025-12-04T09:41:44.4069475Z 2025-12-04T09:41:44.4069571Z async_compile.wait(globals()) 2025-12-04T09:41:44.4069699Z del async_compile 2025-12-04T09:41:44.4069704Z 2025-12-04T09:41:44.4069780Z class Runner: 2025-12-04T09:41:44.4069880Z def __init__(self, partitions): 2025-12-04T09:41:44.4069985Z self.partitions = partitions 2025-12-04T09:41:44.4069989Z 2025-12-04T09:41:44.4070101Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4070195Z new_callables = [] 2025-12-04T09:41:44.4070313Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4070417Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4070527Z self.partitions = new_callables 2025-12-04T09:41:44.4070533Z 2025-12-04T09:41:44.4070621Z def call(self, args): 2025-12-04T09:41:44.4070709Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4070792Z args.clear() 2025-12-04T09:41:44.4070917Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4071041Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4071150Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4071248Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4071414Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4071632Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4071730Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4071962Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4072047Z del arg0_1 2025-12-04T09:41:44.4072207Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4072462Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4072562Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4072781Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4072862Z del arg1_1 2025-12-04T09:41:44.4072946Z del buf0 2025-12-04T09:41:44.4073031Z return (buf1, ) 2025-12-04T09:41:44.4073036Z 2025-12-04T09:41:44.4073135Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4073219Z call = runner.call 2025-12-04T09:41:44.4073418Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4073423Z 2025-12-04T09:41:44.4073427Z 2025-12-04T09:41:44.4073570Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4073706Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4073854Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4074097Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4074299Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4074398Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4074559Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4074566Z 2025-12-04T09:41:44.4074573Z 2025-12-04T09:41:44.4074660Z if __name__ == "__main__": 2025-12-04T09:41:44.4074859Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4075020Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4075105Z From CHECK: .to( 2025-12-04T09:41:44.4075109Z 2025-12-04T09:41:44.4075113Z 2025-12-04T09:41:44.4075288Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4075828Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4075835Z 2025-12-04T09:41:44.4076051Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4076584Z FAILED [6.2279s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4076710Z Searched string: 2025-12-04T09:41:44.4076841Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4076845Z 2025-12-04T09:41:44.4076965Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4076969Z 2025-12-04T09:41:44.4077096Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4077225Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4077232Z 2025-12-04T09:41:44.4077327Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4077417Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4077513Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4077609Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4077613Z 2025-12-04T09:41:44.4077700Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4077811Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4077909Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4078025Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4078030Z 2025-12-04T09:41:44.4078038Z 2025-12-04T09:41:44.4078194Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4078199Z 2025-12-04T09:41:44.4078202Z 2025-12-04T09:41:44.4078321Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4078438Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4078550Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4078684Z idx_m = rm[:, None] 2025-12-04T09:41:44.4078772Z idx_n = rn[None, :] 2025-12-04T09:41:44.4078865Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4078869Z 2025-12-04T09:41:44.4078972Z # inductor generates a suffix 2025-12-04T09:41:44.4079060Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4079267Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4079356Z ''', device_str='cuda') 2025-12-04T09:41:44.4079360Z 2025-12-04T09:41:44.4079364Z 2025-12-04T09:41:44.4079462Z async_compile.wait(globals()) 2025-12-04T09:41:44.4079609Z del async_compile 2025-12-04T09:41:44.4079613Z 2025-12-04T09:41:44.4079696Z class Runner: 2025-12-04T09:41:44.4079797Z def __init__(self, partitions): 2025-12-04T09:41:44.4079900Z self.partitions = partitions 2025-12-04T09:41:44.4079904Z 2025-12-04T09:41:44.4080057Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4080149Z new_callables = [] 2025-12-04T09:41:44.4080269Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4080373Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4080476Z self.partitions = new_callables 2025-12-04T09:41:44.4080522Z 2025-12-04T09:41:44.4080616Z def call(self, args): 2025-12-04T09:41:44.4080703Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4080787Z args.clear() 2025-12-04T09:41:44.4080913Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4081043Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4081152Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4081252Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4081414Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4081637Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4081733Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4081925Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4082007Z del arg0_1 2025-12-04T09:41:44.4082166Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4082421Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4082519Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4082735Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4082820Z del arg1_1 2025-12-04T09:41:44.4082947Z del buf0 2025-12-04T09:41:44.4083034Z return (buf1, ) 2025-12-04T09:41:44.4083040Z 2025-12-04T09:41:44.4083140Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4083225Z call = runner.call 2025-12-04T09:41:44.4083386Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4083390Z 2025-12-04T09:41:44.4083394Z 2025-12-04T09:41:44.4083533Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4083665Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4083813Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4084012Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4084216Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4084315Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4084478Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4084485Z 2025-12-04T09:41:44.4084489Z 2025-12-04T09:41:44.4084578Z if __name__ == "__main__": 2025-12-04T09:41:44.4084777Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4084939Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4085025Z From CHECK: .to( 2025-12-04T09:41:44.4085030Z 2025-12-04T09:41:44.4085033Z 2025-12-04T09:41:44.4085248Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4085796Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4085803Z 2025-12-04T09:41:44.4086018Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4086547Z FAILED [7.0860s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4086638Z Searched string: 2025-12-04T09:41:44.4086769Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4086773Z 2025-12-04T09:41:44.4086893Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4086897Z 2025-12-04T09:41:44.4087063Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4087190Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4087198Z 2025-12-04T09:41:44.4087295Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4087384Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4087481Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4087616Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4087621Z 2025-12-04T09:41:44.4087710Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4087803Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4087895Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4088000Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4088009Z 2025-12-04T09:41:44.4088014Z 2025-12-04T09:41:44.4088200Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4088205Z 2025-12-04T09:41:44.4088210Z 2025-12-04T09:41:44.4088331Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4088453Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4088565Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4088651Z idx_m = rm[:, None] 2025-12-04T09:41:44.4088737Z idx_n = rn[None, :] 2025-12-04T09:41:44.4088831Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4088835Z 2025-12-04T09:41:44.4088934Z # inductor generates a suffix 2025-12-04T09:41:44.4089025Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4089233Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4089327Z ''', device_str='cuda') 2025-12-04T09:41:44.4089331Z 2025-12-04T09:41:44.4089334Z 2025-12-04T09:41:44.4089432Z async_compile.wait(globals()) 2025-12-04T09:41:44.4089557Z del async_compile 2025-12-04T09:41:44.4089561Z 2025-12-04T09:41:44.4089645Z class Runner: 2025-12-04T09:41:44.4089745Z def __init__(self, partitions): 2025-12-04T09:41:44.4089848Z self.partitions = partitions 2025-12-04T09:41:44.4089855Z 2025-12-04T09:41:44.4089968Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4090056Z new_callables = [] 2025-12-04T09:41:44.4090177Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4090280Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4090382Z self.partitions = new_callables 2025-12-04T09:41:44.4090389Z 2025-12-04T09:41:44.4090481Z def call(self, args): 2025-12-04T09:41:44.4090568Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4090649Z args.clear() 2025-12-04T09:41:44.4090776Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4090901Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4091012Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4091106Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4091269Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4091489Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4091586Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4091893Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4091986Z del arg0_1 2025-12-04T09:41:44.4092146Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4092402Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4092501Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4092716Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4092802Z del arg1_1 2025-12-04T09:41:44.4092883Z del buf0 2025-12-04T09:41:44.4092965Z return (buf1, ) 2025-12-04T09:41:44.4092969Z 2025-12-04T09:41:44.4093074Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4093160Z call = runner.call 2025-12-04T09:41:44.4093361Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4093366Z 2025-12-04T09:41:44.4093370Z 2025-12-04T09:41:44.4093516Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4093647Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4093801Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4094042Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4094242Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4094343Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4094505Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4094512Z 2025-12-04T09:41:44.4094516Z 2025-12-04T09:41:44.4094607Z if __name__ == "__main__": 2025-12-04T09:41:44.4094806Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4094965Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4095051Z From CHECK: .to( 2025-12-04T09:41:44.4095055Z 2025-12-04T09:41:44.4095059Z 2025-12-04T09:41:44.4095233Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4095778Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4095785Z 2025-12-04T09:41:44.4096000Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4096528Z FAILED [7.5149s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4096659Z Searched string: 2025-12-04T09:41:44.4096792Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4096796Z 2025-12-04T09:41:44.4096914Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4096918Z 2025-12-04T09:41:44.4097046Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4097172Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4097181Z 2025-12-04T09:41:44.4097277Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4097371Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4097469Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4097563Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4097568Z 2025-12-04T09:41:44.4097655Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4097746Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4097837Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4097925Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4097932Z 2025-12-04T09:41:44.4097935Z 2025-12-04T09:41:44.4098093Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4098098Z 2025-12-04T09:41:44.4098102Z 2025-12-04T09:41:44.4098220Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4098340Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4098453Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4098582Z idx_m = rm[:, None] 2025-12-04T09:41:44.4098671Z idx_n = rn[None, :] 2025-12-04T09:41:44.4098764Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4098771Z 2025-12-04T09:41:44.4098868Z # inductor generates a suffix 2025-12-04T09:41:44.4098961Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4099171Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4099260Z ''', device_str='cuda') 2025-12-04T09:41:44.4099264Z 2025-12-04T09:41:44.4099268Z 2025-12-04T09:41:44.4099366Z async_compile.wait(globals()) 2025-12-04T09:41:44.4099451Z del async_compile 2025-12-04T09:41:44.4099455Z 2025-12-04T09:41:44.4099538Z class Runner: 2025-12-04T09:41:44.4099638Z def __init__(self, partitions): 2025-12-04T09:41:44.4099738Z self.partitions = partitions 2025-12-04T09:41:44.4099782Z 2025-12-04T09:41:44.4099903Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4099993Z new_callables = [] 2025-12-04T09:41:44.4100113Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4100218Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4100489Z self.partitions = new_callables 2025-12-04T09:41:44.4100570Z 2025-12-04T09:41:44.4100671Z def call(self, args): 2025-12-04T09:41:44.4100760Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4100842Z args.clear() 2025-12-04T09:41:44.4100970Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4101093Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4101200Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4101301Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4101464Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4101687Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4101784Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4101974Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4102058Z del arg0_1 2025-12-04T09:41:44.4102216Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4102468Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4102568Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4102784Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.4102931Z del arg1_1 2025-12-04T09:41:44.4103010Z del buf0 2025-12-04T09:41:44.4103095Z return (buf1, ) 2025-12-04T09:41:44.4103100Z 2025-12-04T09:41:44.4103201Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4103285Z call = runner.call 2025-12-04T09:41:44.4103442Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4103446Z 2025-12-04T09:41:44.4103449Z 2025-12-04T09:41:44.4103592Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4103722Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4103870Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4104068Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4104268Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4104369Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4104530Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4104537Z 2025-12-04T09:41:44.4104541Z 2025-12-04T09:41:44.4104631Z if __name__ == "__main__": 2025-12-04T09:41:44.4104831Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4104993Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4105081Z From CHECK: .to( 2025-12-04T09:41:44.4105085Z 2025-12-04T09:41:44.4105089Z 2025-12-04T09:41:44.4105321Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4105870Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4105881Z 2025-12-04T09:41:44.4106098Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4106625Z FAILED [6.3124s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4106716Z Searched string: 2025-12-04T09:41:44.4106849Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4106853Z 2025-12-04T09:41:44.4106969Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4106973Z 2025-12-04T09:41:44.4107157Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4107289Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4107293Z 2025-12-04T09:41:44.4107393Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4107481Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4107630Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4107741Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4107745Z 2025-12-04T09:41:44.4107848Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4107946Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4108040Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4108128Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4108134Z 2025-12-04T09:41:44.4108138Z 2025-12-04T09:41:44.4108298Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4108303Z 2025-12-04T09:41:44.4108306Z 2025-12-04T09:41:44.4108424Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4108541Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4108658Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4108747Z idx_m = rm[:, None] 2025-12-04T09:41:44.4108834Z idx_n = rn[None, :] 2025-12-04T09:41:44.4108927Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4108934Z 2025-12-04T09:41:44.4109029Z # inductor generates a suffix 2025-12-04T09:41:44.4109126Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4109334Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4109420Z ''', device_str='cuda') 2025-12-04T09:41:44.4109424Z 2025-12-04T09:41:44.4109428Z 2025-12-04T09:41:44.4109572Z async_compile.wait(globals()) 2025-12-04T09:41:44.4109653Z del async_compile 2025-12-04T09:41:44.4109658Z 2025-12-04T09:41:44.4109742Z class Runner: 2025-12-04T09:41:44.4109841Z def __init__(self, partitions): 2025-12-04T09:41:44.4109944Z self.partitions = partitions 2025-12-04T09:41:44.4109951Z 2025-12-04T09:41:44.4110062Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4110152Z new_callables = [] 2025-12-04T09:41:44.4110271Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4110378Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4110479Z self.partitions = new_callables 2025-12-04T09:41:44.4110486Z 2025-12-04T09:41:44.4110576Z def call(self, args): 2025-12-04T09:41:44.4110663Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4110745Z args.clear() 2025-12-04T09:41:44.4110875Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4110999Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4111108Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4111207Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4111370Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4111588Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4111689Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4111917Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4112003Z del arg0_1 2025-12-04T09:41:44.4112161Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4112416Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4112516Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4112732Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 16, 1, 1, stream=stream0) 2025-12-04T09:41:44.4112820Z del arg1_1 2025-12-04T09:41:44.4112902Z del buf0 2025-12-04T09:41:44.4112986Z return (buf1, ) 2025-12-04T09:41:44.4112990Z 2025-12-04T09:41:44.4113093Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4113177Z call = runner.call 2025-12-04T09:41:44.4113377Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4113383Z 2025-12-04T09:41:44.4113386Z 2025-12-04T09:41:44.4113535Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4113666Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4113813Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4114056Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4114254Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4114357Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4114520Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4114527Z 2025-12-04T09:41:44.4114531Z 2025-12-04T09:41:44.4114619Z if __name__ == "__main__": 2025-12-04T09:41:44.4114821Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4114982Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4115065Z From CHECK: .to( 2025-12-04T09:41:44.4115072Z 2025-12-04T09:41:44.4115076Z 2025-12-04T09:41:44.4115251Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4115790Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4115797Z 2025-12-04T09:41:44.4116014Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4116543Z FAILED [6.1455s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4116699Z Searched string: 2025-12-04T09:41:44.4116832Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4116837Z 2025-12-04T09:41:44.4116954Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4116958Z 2025-12-04T09:41:44.4117089Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4117218Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4117222Z 2025-12-04T09:41:44.4117315Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4117407Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4117502Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4117611Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4117616Z 2025-12-04T09:41:44.4117712Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4117823Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4117923Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4118013Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4118019Z 2025-12-04T09:41:44.4118023Z 2025-12-04T09:41:44.4118180Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4118187Z 2025-12-04T09:41:44.4118191Z 2025-12-04T09:41:44.4118313Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4118428Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4118545Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4118674Z idx_m = rm[:, None] 2025-12-04T09:41:44.4118760Z idx_n = rn[None, :] 2025-12-04T09:41:44.4118857Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4118864Z 2025-12-04T09:41:44.4118962Z # inductor generates a suffix 2025-12-04T09:41:44.4119050Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4119262Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4119348Z ''', device_str='cuda') 2025-12-04T09:41:44.4119352Z 2025-12-04T09:41:44.4119356Z 2025-12-04T09:41:44.4119460Z async_compile.wait(globals()) 2025-12-04T09:41:44.4119591Z del async_compile 2025-12-04T09:41:44.4119596Z 2025-12-04T09:41:44.4119675Z class Runner: 2025-12-04T09:41:44.4119779Z def __init__(self, partitions): 2025-12-04T09:41:44.4119881Z self.partitions = partitions 2025-12-04T09:41:44.4119929Z 2025-12-04T09:41:44.4120046Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4120135Z new_callables = [] 2025-12-04T09:41:44.4120251Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4120357Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4120461Z self.partitions = new_callables 2025-12-04T09:41:44.4120503Z 2025-12-04T09:41:44.4120598Z def call(self, args): 2025-12-04T09:41:44.4120689Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4120770Z args.clear() 2025-12-04T09:41:44.4120895Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4121024Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4121133Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4121234Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4121398Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4121617Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4121716Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4121905Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4121986Z del arg0_1 2025-12-04T09:41:44.4122152Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4122412Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4122514Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4122729Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4122855Z del arg1_1 2025-12-04T09:41:44.4122938Z del buf0 2025-12-04T09:41:44.4123023Z return (buf1, ) 2025-12-04T09:41:44.4123028Z 2025-12-04T09:41:44.4123129Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4123216Z call = runner.call 2025-12-04T09:41:44.4123375Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4123379Z 2025-12-04T09:41:44.4123383Z 2025-12-04T09:41:44.4123525Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4123655Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4123801Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4124005Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4124206Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4124309Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4124474Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4124481Z 2025-12-04T09:41:44.4124484Z 2025-12-04T09:41:44.4124572Z if __name__ == "__main__": 2025-12-04T09:41:44.4124774Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4124935Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4125019Z From CHECK: .to( 2025-12-04T09:41:44.4125023Z 2025-12-04T09:41:44.4125027Z 2025-12-04T09:41:44.4125246Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4125797Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4125805Z 2025-12-04T09:41:44.4126022Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4126551Z FAILED [7.9201s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:41:44.4126640Z Searched string: 2025-12-04T09:41:44.4126774Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:41:44.4126778Z 2025-12-04T09:41:44.4126894Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:41:44.4126938Z 2025-12-04T09:41:44.4127073Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4127201Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:41:44.4127206Z 2025-12-04T09:41:44.4127302Z idx_m = offs_a_m[:, None] 2025-12-04T09:41:44.4127394Z idx_n = a_k_idx_vals 2025-12-04T09:41:44.4127530Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4127623Z a = tl.load(A + (xindex)) 2025-12-04T09:41:44.4127627Z 2025-12-04T09:41:44.4127717Z idx_m = b_k_idx_vals 2025-12-04T09:41:44.4127810Z idx_n = offs_b_n[None, :] 2025-12-04T09:41:44.4127903Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4127993Z b = tl.load(B + (xindex)) 2025-12-04T09:41:44.4128000Z 2025-12-04T09:41:44.4128004Z 2025-12-04T09:41:44.4128160Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:41:44.4128164Z 2025-12-04T09:41:44.4128168Z 2025-12-04T09:41:44.4128292Z # rematerialize rm and rn to save registers 2025-12-04T09:41:44.4128406Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:41:44.4128524Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:41:44.4128612Z idx_m = rm[:, None] 2025-12-04T09:41:44.4128695Z idx_n = rn[None, :] 2025-12-04T09:41:44.4128791Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:41:44.4128797Z 2025-12-04T09:41:44.4128893Z # inductor generates a suffix 2025-12-04T09:41:44.4128981Z xindex = idx_n + 256*idx_m 2025-12-04T09:41:44.4129191Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:41:44.4129278Z ''', device_str='cuda') 2025-12-04T09:41:44.4129282Z 2025-12-04T09:41:44.4129285Z 2025-12-04T09:41:44.4129428Z async_compile.wait(globals()) 2025-12-04T09:41:44.4129509Z del async_compile 2025-12-04T09:41:44.4129513Z 2025-12-04T09:41:44.4129593Z class Runner: 2025-12-04T09:41:44.4129694Z def __init__(self, partitions): 2025-12-04T09:41:44.4129798Z self.partitions = partitions 2025-12-04T09:41:44.4129802Z 2025-12-04T09:41:44.4129910Z def recursively_apply_fns(self, fns): 2025-12-04T09:41:44.4130003Z new_callables = [] 2025-12-04T09:41:44.4130120Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:41:44.4130223Z new_callables.append(fn(c)) 2025-12-04T09:41:44.4130333Z self.partitions = new_callables 2025-12-04T09:41:44.4130337Z 2025-12-04T09:41:44.4130425Z def call(self, args): 2025-12-04T09:41:44.4130516Z arg0_1, arg1_1 = args 2025-12-04T09:41:44.4130599Z args.clear() 2025-12-04T09:41:44.4130725Z assert_size_stride(arg0_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4130852Z assert_size_stride(arg1_1, (256, 256), (256, 1)) 2025-12-04T09:41:44.4130964Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:41:44.4131058Z torch.cuda.set_device(0) 2025-12-04T09:41:44.4131228Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4131445Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:41:44.4131546Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4131775Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 65536, stream=stream0) 2025-12-04T09:41:44.4131857Z del arg0_1 2025-12-04T09:41:44.4132025Z buf1 = empty_strided_cuda((256, 256), (256, 1), torch.float32) 2025-12-04T09:41:44.4132275Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:41:44.4132374Z stream0 = get_raw_stream(0) 2025-12-04T09:41:44.4132592Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 64, 1, 1, stream=stream0) 2025-12-04T09:41:44.4132676Z del arg1_1 2025-12-04T09:41:44.4132756Z del buf0 2025-12-04T09:41:44.4132842Z return (buf1, ) 2025-12-04T09:41:44.4132846Z 2025-12-04T09:41:44.4132946Z runner = Runner(partitions=[]) 2025-12-04T09:41:44.4133032Z call = runner.call 2025-12-04T09:41:44.4133228Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:41:44.4133233Z 2025-12-04T09:41:44.4133237Z 2025-12-04T09:41:44.4133384Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:41:44.4133517Z from torch._dynamo.testing import rand_strided 2025-12-04T09:41:44.4133663Z from torch._inductor.utils import print_performance 2025-12-04T09:41:44.4133903Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:41:44.4134100Z arg1_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:41:44.4134198Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:41:44.4134364Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:41:44.4134371Z 2025-12-04T09:41:44.4134375Z 2025-12-04T09:41:44.4134465Z if __name__ == "__main__": 2025-12-04T09:41:44.4134670Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:41:44.4134834Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:41:44.4134919Z From CHECK: .to( 2025-12-04T09:41:44.4134923Z 2025-12-04T09:41:44.4134927Z 2025-12-04T09:41:44.4135104Z To execute this test, run the following from the base repo dir: 2025-12-04T09:41:44.4135641Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_exhaustive_dtypes 2025-12-04T09:41:44.4135648Z 2025-12-04T09:41:44.4135860Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:41:44.4136040Z ============ 30 failed, 70 passed, 150 skipped in 553.89s (0:09:13) ============ 2025-12-04T09:41:44.4136044Z 2025-12-04T09:41:44.4136561Z FINISHED PRINTING LOG FILE of inductor/test_pattern_matcher 1/1 (test/test-reports/inductor.test_pattern_matcher_1.1_71c2676cd32e51e5_.log) 2025-12-04T09:41:44.4136566Z 2025-12-04T09:41:44.4136863Z Finished inductor/test_pattern_matcher 1/1 ... [2025-12-04 09:41:42.825377][1781.605340815], took 9.37min 2025-12-04T09:41:44.4137521Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-c842e470cbb98a3c.xml 2025-12-04T09:41:44.4137662Z Uploading logs for 57118183207 to S3 2025-12-04T09:41:44.4137771Z Uploading artifacts took 1.01 seconds 2025-12-04T09:41:44.4137884Z inductor/test_pattern_matcher 1/1 failed! 2025-12-04T09:41:44.4138116Z Running inductor/test_cuda_repro 1/1 ... [2025-12-04 09:41:44.259384][1783.039352263] 2025-12-04T09:41:44.4138222Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:44.4139164Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_repro.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:41:44.259942] 2025-12-04T09:41:51.4893494Z 2025-12-04T09:41:51.4894755Z inductor/test_cuda_repro 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_repro_1.1_6d15300668add3fc_.log 2025-12-04T09:41:51.4961182Z Running 200 items in this shard: test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_emulate_precision_casts_mean_ratio_chain, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_repeated_masked_load, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent, test/inductor/test_cuda_repro.py::CudaReproTests::test_truediv_base_not_bitwise_equivalent 2025-12-04T09:41:51.5026158Z 2025-12-04T09:41:51.5026437Z Finished inductor/test_cuda_repro 1/1 ... [2025-12-04 09:41:51.489270][1790.269242284], took 0.12min 2025-12-04T09:41:51.5027431Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_cuda_repro/inductor.test_cuda_repro-e99e0f6eb81b07f7.xml 2025-12-04T09:41:51.5823213Z Running dynamo/test_activation_checkpointing 1/1 ... [2025-12-04 09:41:51.581895][1790.361867665] 2025-12-04T09:41:51.5823722Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:51.5826804Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_activation_checkpointing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:41:51.582257] 2025-12-04T09:41:58.1085459Z 2025-12-04T09:41:58.1086602Z dynamo/test_activation_checkpointing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_activation_checkpointing_1.1_3779be9d1a103562_.log 2025-12-04T09:41:58.1087475Z Running 0 items in this shard: 2025-12-04T09:41:58.1087665Z 2025-12-04T09:41:58.1087995Z Finished dynamo/test_activation_checkpointing 1/1 ... [2025-12-04 09:41:58.108188][1796.888161415], took 0.11min 2025-12-04T09:41:58.1101115Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_activation_checkpointing/dynamo.test_activation_checkpointing-38da9e9ed6d54bce.xml 2025-12-04T09:41:58.1715836Z Running dynamo/test_logging 1/1 ... [2025-12-04 09:41:58.171184][1796.951157116] 2025-12-04T09:41:58.1716406Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:41:58.1719820Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_logging.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:41:58.171550] 2025-12-04T09:42:04.4475224Z 2025-12-04T09:42:04.4476440Z dynamo/test_logging 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_logging_1.1_30144991ab7ca96e_.log 2025-12-04T09:42:04.4477203Z Running 0 items in this shard: 2025-12-04T09:42:04.4477393Z 2025-12-04T09:42:04.4477678Z Finished dynamo/test_logging 1/1 ... [2025-12-04 09:42:04.447203][1803.227176237], took 0.10min 2025-12-04T09:42:04.4493880Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_logging/dynamo.test_logging-f357a72c7cdbf6f8.xml 2025-12-04T09:42:04.5210211Z Running dynamo/test_repros 1/1 ... [2025-12-04 09:42:04.520672][1803.300645273] 2025-12-04T09:42:04.5210654Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:04.5214442Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_repros.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:42:04.521001] 2025-12-04T09:42:35.4518432Z 2025-12-04T09:42:35.4519428Z dynamo/test_repros 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_repros_1.1_3a07b22e3f77e1e4_.log 2025-12-04T09:42:35.4607200Z Running 350 items in this shard: test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_create_rand_mask_from_inputs, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_aggressively_write_assert, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_dont_dce_rand, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_get_parameter_dtype, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_graph_break_unsupported_fake, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename, test/dynamo/test_repros.py::ReproTests::test_relative_import_no_modulename 2025-12-04T09:42:35.4691754Z 2025-12-04T09:42:35.4692010Z Finished dynamo/test_repros 1/1 ... [2025-12-04 09:42:35.451941][1834.231913615], took 0.52min 2025-12-04T09:42:35.4692926Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_repros/dynamo.test_repros-9531739f45e08308.xml 2025-12-04T09:42:35.5505020Z Running inductor/test_flex_attention 2/6 ... [2025-12-04 09:42:35.550136][1834.330106231] 2025-12-04T09:42:35.5505681Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:35.5509360Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=2', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:42:35.550508] 2025-12-04T09:42:43.3289492Z 2025-12-04T09:42:43.3290385Z inductor/test_flex_attention 2/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_2.6_90fdd3b7d65ce4b2_.log 2025-12-04T09:42:43.3291466Z Running 0 items in this shard: 2025-12-04T09:42:43.3291648Z 2025-12-04T09:42:43.3291972Z Finished inductor/test_flex_attention 2/6 ... [2025-12-04 09:42:43.328593][1842.108565852], took 0.13min 2025-12-04T09:42:43.3310099Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-b253e91c6bffe97f.xml 2025-12-04T09:42:43.3925282Z Running inductor/test_flex_decoding 2/3 ... [2025-12-04 09:42:43.392192][1842.172165252] 2025-12-04T09:42:43.3925769Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:43.3929171Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_decoding.py', '--shard-id=2', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:42:43.392536] 2025-12-04T09:42:48.0153149Z 2025-12-04T09:42:48.0154128Z inductor/test_flex_decoding 2/3 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_decoding_2.3_729d33e033147be3_.log 2025-12-04T09:42:48.0154925Z Running 0 items in this shard: 2025-12-04T09:42:48.0155107Z 2025-12-04T09:42:48.0155679Z Finished inductor/test_flex_decoding 2/3 ... [2025-12-04 09:42:48.014997][1846.794970856], took 0.08min 2025-12-04T09:42:48.0175335Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_decoding/inductor.test_flex_decoding-718ee71e5383cf4c.xml 2025-12-04T09:42:48.0382183Z Running dynamo/test_fx_graph_runnable 1/1 ... [2025-12-04 09:42:48.037901][1846.817875937] 2025-12-04T09:42:48.0382749Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:42:48.0385677Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_fx_graph_runnable.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:42:48.038205] 2025-12-04T10:37:41.3580124Z 2025-12-04T10:37:41.3581432Z dynamo/test_fx_graph_runnable 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_fx_graph_runnable_1.1_ae8ca2ee8e2c6bb8_.log 2025-12-04T10:37:41.3703918Z Running 350 items in this shard: test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_all_reduce_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_basic_tensor_add, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_add_dynamic, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_broadcast_collective, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_expression, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_dynamic_shapes_run, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing, test/dynamo/test_fx_graph_runnable.py::FxGraphRunnableTest::test_toy_model_batch_processing 2025-12-04T10:37:41.3813025Z 2025-12-04T10:37:41.3813333Z Finished dynamo/test_fx_graph_runnable 1/1 ... [2025-12-04 10:37:41.357312][5140.137284497], took 54.89min 2025-12-04T10:37:41.3814379Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_fx_graph_runnable/dynamo.test_fx_graph_runnable-51563ef4cc34da7e.xml 2025-12-04T10:37:42.1926292Z Uploading artifacts took 0.74 seconds 2025-12-04T10:37:42.1930195Z Running inductor/test_online_softmax 1/1 ... [2025-12-04 10:37:42.192660][5140.972631322] 2025-12-04T10:37:42.1931130Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:37:42.1934489Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_online_softmax.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:37:42.193098] 2025-12-04T10:37:48.4698723Z 2025-12-04T10:37:48.4699715Z inductor/test_online_softmax 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_online_softmax_1.1_40e23bca08e33227_.log 2025-12-04T10:37:48.4700895Z Running 0 items in this shard: 2025-12-04T10:37:48.4701081Z 2025-12-04T10:37:48.4701374Z Finished inductor/test_online_softmax 1/1 ... [2025-12-04 10:37:48.469477][5147.249450242], took 0.10min 2025-12-04T10:37:48.4725777Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_online_softmax/inductor.test_online_softmax-f797910239038b77.xml 2025-12-04T10:37:48.5412651Z Running inductor/test_memory 1/1 ... [2025-12-04 10:37:48.540899][5147.320871452] 2025-12-04T10:37:48.5413097Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:37:48.5416450Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_memory.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:37:48.541247] 2025-12-04T10:37:54.7672857Z 2025-12-04T10:37:54.7673905Z inductor/test_memory 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_memory_1.1_4f8d5ba8b79fb015_.log 2025-12-04T10:37:54.7674622Z Running 0 items in this shard: 2025-12-04T10:37:54.7675135Z 2025-12-04T10:37:54.7675401Z Finished inductor/test_memory 1/1 ... [2025-12-04 10:37:54.766900][5153.546873354], took 0.10min 2025-12-04T10:37:54.7699488Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_memory/inductor.test_memory-380c528ed363230b.xml 2025-12-04T10:37:54.8428189Z Running dynamo/test_streams 1/1 ... [2025-12-04 10:37:54.842464][5153.622436899] 2025-12-04T10:37:54.8428644Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:37:54.8431465Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_streams.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:37:54.842786] 2025-12-04T10:37:58.6147665Z 2025-12-04T10:37:58.6148485Z dynamo/test_streams 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_streams_1.1_d5d585ff08c4417f_.log 2025-12-04T10:37:58.6149198Z Running 0 items in this shard: 2025-12-04T10:37:58.6149376Z 2025-12-04T10:37:58.6149631Z Finished dynamo/test_streams 1/1 ... [2025-12-04 10:37:58.614335][5157.394308172], took 0.06min 2025-12-04T10:37:58.6175379Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_streams/dynamo.test_streams-8a8793e6ce17b0e2.xml 2025-12-04T10:37:58.6425745Z Running inductor/test_unbacked_symints 1/1 ... [2025-12-04 10:37:58.642175][5157.422149092] 2025-12-04T10:37:58.6426681Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:37:58.6428186Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_unbacked_symints.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:37:58.642487] 2025-12-04T10:38:05.0695937Z 2025-12-04T10:38:05.0697006Z inductor/test_unbacked_symints 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_unbacked_symints_1.1_0cb6fcdf41989d7a_.log 2025-12-04T10:38:05.0697809Z Running 0 items in this shard: 2025-12-04T10:38:05.0698004Z 2025-12-04T10:38:05.0698310Z Finished inductor/test_unbacked_symints 1/1 ... [2025-12-04 10:38:05.069243][5163.849216746], took 0.11min 2025-12-04T10:38:05.0727454Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_unbacked_symints/inductor.test_unbacked_symints-dfbf01aaa57bc123.xml 2025-12-04T10:38:05.1550171Z Running dynamo/test_aot_compile 1/1 ... [2025-12-04 10:38:05.154611][5163.93458492] 2025-12-04T10:38:05.1550764Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:05.1554187Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_aot_compile.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:05.154984] 2025-12-04T10:38:08.7268891Z 2025-12-04T10:38:08.7269978Z dynamo/test_aot_compile 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_aot_compile_1.1_2f925d88d81eb066_.log 2025-12-04T10:38:08.7270702Z Running 0 items in this shard: 2025-12-04T10:38:08.7270903Z 2025-12-04T10:38:08.7271183Z Finished dynamo/test_aot_compile 1/1 ... [2025-12-04 10:38:08.726466][5167.506436279], took 0.06min 2025-12-04T10:38:08.7305656Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_aot_compile/dynamo.test_aot_compile-f9d7f69515583290.xml 2025-12-04T10:38:08.7576595Z Running test_privateuseone_python_backend 1/1 ... [2025-12-04 10:38:08.757307][5167.537280879] 2025-12-04T10:38:08.7577227Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:08.7580689Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_privateuseone_python_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:08.757653] 2025-12-04T10:38:11.9789382Z 2025-12-04T10:38:11.9790051Z test_privateuseone_python_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_privateuseone_python_backend_1.1_38648deebb02f255_.log 2025-12-04T10:38:11.9790863Z Running 0 items in this shard: 2025-12-04T10:38:11.9791047Z 2025-12-04T10:38:11.9791361Z Finished test_privateuseone_python_backend 1/1 ... [2025-12-04 10:38:11.978632][5170.75860512], took 0.05min 2025-12-04T10:38:11.9825029Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-7cc4530ad1782b86.xml 2025-12-04T10:38:12.0120237Z Running test_varlen_attention 1/1 ... [2025-12-04 10:38:12.011669][5170.791643253] 2025-12-04T10:38:12.0120844Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:12.0123655Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_varlen_attention.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:12.011982] 2025-12-04T10:38:15.8342293Z 2025-12-04T10:38:15.8343074Z test_varlen_attention 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_varlen_attention_1.1_ca1d0403b32fd758_.log 2025-12-04T10:38:15.8343792Z Running 0 items in this shard: 2025-12-04T10:38:15.8343978Z 2025-12-04T10:38:15.8344242Z Finished test_varlen_attention 1/1 ... [2025-12-04 10:38:15.833858][5174.613831681], took 0.06min 2025-12-04T10:38:15.8378946Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_varlen_attention/test_varlen_attention-ad0d22c525c989e0.xml 2025-12-04T10:38:15.8657436Z Running test_autograd 1/1 ... [2025-12-04 10:38:15.865377][5174.645351332] 2025-12-04T10:38:15.8657846Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:15.8668939Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_autograd.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:15.865739] 2025-12-04T10:38:21.9930688Z 2025-12-04T10:38:21.9931525Z test_autograd 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_autograd_1.1_42a7f311e778fd6d_.log 2025-12-04T10:38:21.9989619Z Running 150 items in this shard: test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_hessian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradFunctional::test_jacobian_vectorize_raises_no_warnings_logging_tensor, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda, test/test_autograd.py::TestAutogradStreamSynchronizationCUDA::test_side_stream_backward_overlap_cuda 2025-12-04T10:38:22.0048232Z 2025-12-04T10:38:22.0048468Z Finished test_autograd 1/1 ... [2025-12-04 10:38:21.992948][5180.772918466], took 0.10min 2025-12-04T10:38:22.0049318Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_autograd/test_autograd-024c4c739b0a1a21.xml 2025-12-04T10:38:22.0810166Z Running test_ops_fwd_gradients 7/12 ... [2025-12-04 10:38:22.080682][5180.860655887] 2025-12-04T10:38:22.0810619Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:22.0814038Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_fwd_gradients.py', '--shard-id=7', '--num-shards=12', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:22.080999] 2025-12-04T10:38:32.3156516Z 2025-12-04T10:38:32.3157528Z test_ops_fwd_gradients 7/12 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_fwd_gradients_7.12_01dcdd6b3df592a7_.log 2025-12-04T10:38:32.3176404Z Running 50 items in this shard: test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64, test/test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_cholesky_solve_cuda_float64 2025-12-04T10:38:32.3194483Z 2025-12-04T10:38:32.3194749Z Finished test_ops_fwd_gradients 7/12 ... [2025-12-04 10:38:32.315342][5191.095315847], took 0.17min 2025-12-04T10:38:32.3198464Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-b1f2319d42b89977.xml 2025-12-04T10:38:32.4000201Z Running test_ops_gradients 3/10 ... [2025-12-04 10:38:32.399670][5191.179643025] 2025-12-04T10:38:32.4000780Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:32.4004053Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_gradients.py', '--shard-id=3', '--num-shards=10', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:32.400018] 2025-12-04T10:38:46.1419175Z 2025-12-04T10:38:46.1419965Z test_ops_gradients 3/10 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_gradients_3.10_eaaa384ce8d837b0_.log 2025-12-04T10:38:46.1420653Z Running 0 items in this shard: 2025-12-04T10:38:46.1421149Z 2025-12-04T10:38:46.1421410Z Finished test_ops_gradients 3/10 ... [2025-12-04 10:38:46.141555][5204.92152938], took 0.23min 2025-12-04T10:38:46.1460592Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ops_gradients/test_ops_gradients-835aa3d27e77a0f6.xml 2025-12-04T10:38:46.2171573Z Running test_nestedtensor 1/4 ... [2025-12-04 10:38:46.216788][5204.996761759] 2025-12-04T10:38:46.2172012Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:46.2175258Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_nestedtensor.py', '--shard-id=1', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:46.217127] 2025-12-04T10:38:54.0459201Z 2025-12-04T10:38:54.0459946Z test_nestedtensor 1/4 was successful, full logs can be found in artifacts with path test/test-reports/test_nestedtensor_1.4_e15b8d770f04fe67_.log 2025-12-04T10:38:54.0460647Z Running 0 items in this shard: 2025-12-04T10:38:54.0460852Z 2025-12-04T10:38:54.0461135Z Finished test_nestedtensor 1/4 ... [2025-12-04 10:38:54.045567][5212.825540516], took 0.13min 2025-12-04T10:38:54.0502048Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_nestedtensor/test_nestedtensor-295bc14f77bb0dc7.xml 2025-12-04T10:38:54.1172651Z Running test_sparse_csr 2/2 ... [2025-12-04 10:38:54.116889][5212.896862903] 2025-12-04T10:38:54.1173076Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:38:54.1177343Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sparse_csr.py', '--shard-id=2', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:38:54.117224] 2025-12-04T10:39:07.5576504Z 2025-12-04T10:39:07.5577486Z test_sparse_csr 2/2 was successful, full logs can be found in artifacts with path test/test-reports/test_sparse_csr_2.2_8d87173a8ebf96a6_.log 2025-12-04T10:39:07.5591177Z Running 50 items in this shard: test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32 2025-12-04T10:39:07.5605195Z 2025-12-04T10:39:07.5605461Z Finished test_sparse_csr 2/2 ... [2025-12-04 10:39:07.557304][5226.337275508], took 0.22min 2025-12-04T10:39:07.5630156Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-0efbb8a89000c5fe.xml 2025-12-04T10:39:07.6343733Z Running test_overrides 1/1 ... [2025-12-04 10:39:07.633967][5226.41394008] 2025-12-04T10:39:07.6344159Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:07.6347629Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_overrides.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:07.634358] 2025-12-04T10:39:14.3615769Z 2025-12-04T10:39:14.3617391Z test_overrides 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_overrides_1.1_7d80d054d5d65ec3_.log 2025-12-04T10:39:14.3687837Z Running 250 items in this shard: test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api 2025-12-04T10:39:14.3752815Z 2025-12-04T10:39:14.3753052Z Finished test_overrides 1/1 ... [2025-12-04 10:39:14.361656][5233.141626661], took 0.11min 2025-12-04T10:39:14.3753923Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_overrides/test_overrides-1eb0f1d69a632627.xml 2025-12-04T10:39:14.4533195Z Running test_torchfuzz_repros 1/1 ... [2025-12-04 10:39:14.452969][5233.232942307] 2025-12-04T10:39:14.4533643Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:14.4538447Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_torchfuzz_repros.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:14.453348] 2025-12-04T10:39:17.7244228Z 2025-12-04T10:39:17.7245210Z test_torchfuzz_repros 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_torchfuzz_repros_1.1_fcf26854fba7f69d_.log 2025-12-04T10:39:17.7245984Z Running 0 items in this shard: 2025-12-04T10:39:17.7246165Z 2025-12-04T10:39:17.7246437Z Finished test_torchfuzz_repros 1/1 ... [2025-12-04 10:39:17.724106][5236.504078075], took 0.05min 2025-12-04T10:39:17.7292226Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_torchfuzz_repros/test_torchfuzz_repros-1a411881b6ff90d5.xml 2025-12-04T10:39:17.7558309Z Running inductor/test_group_batch_fusion 1/1 ... [2025-12-04 10:39:17.755370][5236.53534448] 2025-12-04T10:39:17.7558990Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:17.7561641Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_group_batch_fusion.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:17.755711] 2025-12-04T10:39:24.0316490Z 2025-12-04T10:39:24.0317663Z inductor/test_group_batch_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_group_batch_fusion_1.1_fc8633d8b0944f89_.log 2025-12-04T10:39:24.0318473Z Running 0 items in this shard: 2025-12-04T10:39:24.0318667Z 2025-12-04T10:39:24.0318964Z Finished inductor/test_group_batch_fusion 1/1 ... [2025-12-04 10:39:24.031295][5242.81126816], took 0.10min 2025-12-04T10:39:24.0365726Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_group_batch_fusion/inductor.test_group_batch_fusion-4467bcde0d834301.xml 2025-12-04T10:39:24.1035509Z Running dynamo/test_dynamic_shapes 1/1 ... [2025-12-04 10:39:24.103201][5242.883174488] 2025-12-04T10:39:24.1036153Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:24.1040343Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_dynamic_shapes.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:24.103519] 2025-12-04T10:39:35.2893409Z 2025-12-04T10:39:35.2894403Z dynamo/test_dynamic_shapes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_dynamic_shapes_1.1_d62bd541228f238b_.log 2025-12-04T10:39:35.2895498Z Running 0 items in this shard: 2025-12-04T10:39:35.2895680Z 2025-12-04T10:39:35.2895968Z Finished dynamo/test_dynamic_shapes 1/1 ... [2025-12-04 10:39:35.288989][5254.06896324], took 0.19min 2025-12-04T10:39:35.2945571Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_dynamic_shapes/dynamo.test_dynamic_shapes-479e026d710782d7.xml 2025-12-04T10:39:35.3598971Z Running inductor/test_custom_lowering 1/1 ... [2025-12-04 10:39:35.359485][5254.139458391] 2025-12-04T10:39:35.3599637Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:35.3602765Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_custom_lowering.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:35.359803] 2025-12-04T10:39:41.5856267Z 2025-12-04T10:39:41.5857405Z inductor/test_custom_lowering 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_custom_lowering_1.1_4af35be031667708_.log 2025-12-04T10:39:41.5858175Z Running 0 items in this shard: 2025-12-04T10:39:41.5858360Z 2025-12-04T10:39:41.5858663Z Finished inductor/test_custom_lowering 1/1 ... [2025-12-04 10:39:41.585247][5260.365220617], took 0.10min 2025-12-04T10:39:41.5909228Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_custom_lowering/inductor.test_custom_lowering-5db5aae03b627061.xml 2025-12-04T10:39:41.6611012Z Running inductor/test_perf 1/1 ... [2025-12-04 10:39:41.660718][5260.440691391] 2025-12-04T10:39:41.6611616Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:41.6614318Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_perf.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:41.661041] 2025-12-04T10:39:47.9369266Z 2025-12-04T10:39:47.9370634Z inductor/test_perf 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_perf_1.1_4a62740f7bb60b7e_.log 2025-12-04T10:39:47.9371335Z Running 0 items in this shard: 2025-12-04T10:39:47.9371515Z 2025-12-04T10:39:47.9371786Z Finished inductor/test_perf 1/1 ... [2025-12-04 10:39:47.936545][5266.716518764], took 0.10min 2025-12-04T10:39:47.9424451Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_perf/inductor.test_perf-cdb6dc2507d07b9f.xml 2025-12-04T10:39:48.0028090Z Running inductor/test_mkldnn_pattern_matcher 1/2 ... [2025-12-04 10:39:48.002399][5266.782372606] 2025-12-04T10:39:48.0028713Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:39:48.0031550Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_mkldnn_pattern_matcher.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:39:48.002747] 2025-12-04T11:06:00.1049230Z 2025-12-04T11:06:00.1052480Z inductor/test_mkldnn_pattern_matcher 1/2 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_mkldnn_pattern_matcher_1.2_ef762047374873df_.log 2025-12-04T11:06:00.1076154Z Running 50 items in this shard: test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True, test/inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_qlinear_add_cpu_use_relu_False_is_qat_False_is_dynamic_True 2025-12-04T11:06:00.1099091Z 2025-12-04T11:06:00.1099424Z Finished inductor/test_mkldnn_pattern_matcher 1/2 ... [2025-12-04 11:06:00.104357][6838.884330904], took 26.20min 2025-12-04T11:06:00.1105858Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_mkldnn_pattern_matcher/inductor.test_mkldnn_pattern_matcher-d34f2a11a4dc09d6.xml 2025-12-04T11:06:01.0411201Z Uploading artifacts took 0.83 seconds 2025-12-04T11:06:01.0414307Z Running inductor/test_cpu_cpp_wrapper 1/1 ... [2025-12-04 11:06:01.041126][6839.821097584] 2025-12-04T11:06:01.0414782Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:01.0419255Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpu_cpp_wrapper.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:01.041533] 2025-12-04T11:06:08.0704772Z 2025-12-04T11:06:08.0705635Z inductor/test_cpu_cpp_wrapper 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpu_cpp_wrapper_1.1_5bf9de19d872c7a4_.log 2025-12-04T11:06:08.0706318Z 2025-12-04T11:06:08.0706635Z Finished inductor/test_cpu_cpp_wrapper 1/1 ... [2025-12-04 11:06:08.069966][6846.849937104], took 0.12min 2025-12-04T11:06:08.0764751Z Running dynamo/test_deque_reconstruct 1/1 ... [2025-12-04 11:06:08.076152][6846.856127442] 2025-12-04T11:06:08.0765229Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:08.0769791Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_deque_reconstruct.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:08.076508] 2025-12-04T11:06:11.2974648Z 2025-12-04T11:06:11.2976090Z dynamo/test_deque_reconstruct 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_deque_reconstruct_1.1_90d10d742de65891_.log 2025-12-04T11:06:11.2977595Z Running 0 items in this shard: 2025-12-04T11:06:11.2977874Z 2025-12-04T11:06:11.2978352Z Finished dynamo/test_deque_reconstruct 1/1 ... [2025-12-04 11:06:11.297205][6850.077174886], took 0.05min 2025-12-04T11:06:11.3037408Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_deque_reconstruct/dynamo.test_deque_reconstruct-a848d884f2f29080.xml 2025-12-04T11:06:11.4062717Z Running inductor/test_utils 1/1 ... [2025-12-04 11:06:11.405851][6850.185823647] 2025-12-04T11:06:11.4063509Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:11.4065547Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:11.406171] 2025-12-04T11:06:14.9773770Z 2025-12-04T11:06:14.9777022Z inductor/test_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_utils_1.1_e3f0ace05e84ce51_.log 2025-12-04T11:06:14.9777738Z Running 0 items in this shard: 2025-12-04T11:06:14.9777939Z 2025-12-04T11:06:14.9778200Z Finished inductor/test_utils 1/1 ... [2025-12-04 11:06:14.977026][6853.756998598], took 0.06min 2025-12-04T11:06:14.9837931Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_utils/inductor.test_utils-bd079476913ef2e3.xml 2025-12-04T11:06:15.0152967Z Running inductor/test_indexing 1/1 ... [2025-12-04 11:06:15.014973][6853.794947929] 2025-12-04T11:06:15.0153420Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:15.0157715Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_indexing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:15.015323] 2025-12-04T11:06:21.2412299Z 2025-12-04T11:06:21.2413416Z inductor/test_indexing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_indexing_1.1_29ca32c31300a4db_.log 2025-12-04T11:06:21.2414145Z Running 0 items in this shard: 2025-12-04T11:06:21.2414331Z 2025-12-04T11:06:21.2414612Z Finished inductor/test_indexing 1/1 ... [2025-12-04 11:06:21.240918][6860.020890254], took 0.10min 2025-12-04T11:06:21.2473996Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_indexing/inductor.test_indexing-fee711b29eff954c.xml 2025-12-04T11:06:21.3140149Z Running inductor/test_inductor_annotations 1/1 ... [2025-12-04 11:06:21.313624][6860.093597116] 2025-12-04T11:06:21.3140741Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:21.3143959Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_inductor_annotations.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:21.313995] 2025-12-04T11:06:27.5397219Z 2025-12-04T11:06:27.5398459Z inductor/test_inductor_annotations 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_inductor_annotations_1.1_6c00e9799b0015c9_.log 2025-12-04T11:06:27.5399281Z Running 0 items in this shard: 2025-12-04T11:06:27.5399464Z 2025-12-04T11:06:27.5399891Z Finished inductor/test_inductor_annotations 1/1 ... [2025-12-04 11:06:27.539407][6866.31938065], took 0.10min 2025-12-04T11:06:27.5463168Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_inductor_annotations/inductor.test_inductor_annotations-f4480b8427e264b4.xml 2025-12-04T11:06:27.6339430Z Running inductor/test_compile_worker 1/1 ... [2025-12-04 11:06:27.633580][6866.413553237] 2025-12-04T11:06:27.6340154Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:27.6343757Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_compile_worker.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:27.633938] 2025-12-04T11:06:33.8597579Z 2025-12-04T11:06:33.8598431Z inductor/test_compile_worker 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_compile_worker_1.1_98dfa7ecc1e7975c_.log 2025-12-04T11:06:33.8599224Z Running 0 items in this shard: 2025-12-04T11:06:33.8599411Z 2025-12-04T11:06:33.8599771Z Finished inductor/test_compile_worker 1/1 ... [2025-12-04 11:06:33.859329][6872.639300814], took 0.10min 2025-12-04T11:06:33.8662552Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_compile_worker/inductor.test_compile_worker-3d3c1537308173d9.xml 2025-12-04T11:06:33.9524821Z Running export/test_serialize 1/1 ... [2025-12-04 11:06:33.952075][6872.732048268] 2025-12-04T11:06:33.9525269Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:33.9528450Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_serialize.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:33.952458] 2025-12-04T11:06:40.3778408Z 2025-12-04T11:06:40.3779182Z export/test_serialize 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_serialize_1.1_91837aecdd2957b9_.log 2025-12-04T11:06:40.3779917Z Running 0 items in this shard: 2025-12-04T11:06:40.3780096Z 2025-12-04T11:06:40.3780372Z Finished export/test_serialize 1/1 ... [2025-12-04 11:06:40.377492][6879.157465234], took 0.11min 2025-12-04T11:06:40.3846709Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/export.test_serialize/export.test_serialize-76457f9e334c8675.xml 2025-12-04T11:06:40.4558502Z Running export/test_export_strict 1/1 ... [2025-12-04 11:06:40.455437][6879.235410639] 2025-12-04T11:06:40.4558980Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:40.4561773Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_export_strict.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:40.455755] 2025-12-04T11:06:47.5334069Z 2025-12-04T11:06:47.5335331Z export/test_export_strict 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_export_strict_1.1_be0a1028c7ae3647_.log 2025-12-04T11:06:47.5336092Z Running 0 items in this shard: 2025-12-04T11:06:47.5336296Z 2025-12-04T11:06:47.5336585Z Finished export/test_export_strict 1/1 ... [2025-12-04 11:06:47.533043][6886.313016025], took 0.12min 2025-12-04T11:06:47.5402837Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/export.test_export_strict/export.test_export_strict-3ce7562e53da487b.xml 2025-12-04T11:06:47.6081489Z Running dynamo/test_buffers_override 1/1 ... [2025-12-04 11:06:47.607793][6886.387766069] 2025-12-04T11:06:47.6084293Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:47.6085536Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_buffers_override.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:47.608136] 2025-12-04T11:06:50.7792166Z 2025-12-04T11:06:50.7793689Z dynamo/test_buffers_override 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_buffers_override_1.1_2b8279aa55d222a3_.log 2025-12-04T11:06:50.7794828Z Running 0 items in this shard: 2025-12-04T11:06:50.7795014Z 2025-12-04T11:06:50.7795401Z Finished dynamo/test_buffers_override 1/1 ... [2025-12-04 11:06:50.778691][6889.558661111], took 0.05min 2025-12-04T11:06:50.7866932Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_buffers_override/dynamo.test_buffers_override-bcc16daa85498c4b.xml 2025-12-04T11:06:50.8111077Z Running inductor/test_split_cat_fx_passes 1/1 ... [2025-12-04 11:06:50.810785][6889.5907582] 2025-12-04T11:06:50.8111567Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:50.8115278Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_split_cat_fx_passes.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:50.811135] 2025-12-04T11:06:57.0372189Z 2025-12-04T11:06:57.0373356Z inductor/test_split_cat_fx_passes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_split_cat_fx_passes_1.1_e5f4e0f1d101e9e4_.log 2025-12-04T11:06:57.0374160Z Running 0 items in this shard: 2025-12-04T11:06:57.0374347Z 2025-12-04T11:06:57.0374889Z Finished inductor/test_split_cat_fx_passes 1/1 ... [2025-12-04 11:06:57.036850][6895.816822468], took 0.10min 2025-12-04T11:06:57.0445836Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_split_cat_fx_passes/inductor.test_split_cat_fx_passes-02502e259211870e.xml 2025-12-04T11:06:57.1040541Z Running inductor/test_cache 1/1 ... [2025-12-04 11:06:57.103636][6895.883609197] 2025-12-04T11:06:57.1041012Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:06:57.1043646Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cache.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:06:57.103969] 2025-12-04T11:07:01.4772744Z 2025-12-04T11:07:01.4774267Z inductor/test_cache 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cache_1.1_eb92e1345f8620a8_.log 2025-12-04T11:07:01.4774988Z Running 0 items in this shard: 2025-12-04T11:07:01.4775169Z 2025-12-04T11:07:01.4775435Z Finished inductor/test_cache 1/1 ... [2025-12-04 11:07:01.476956][6900.256929279], took 0.07min 2025-12-04T11:07:01.4846494Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_cache/inductor.test_cache-258ddb4609e595e9.xml 2025-12-04T11:07:01.5138649Z Running inductor/test_aot_inductor_utils 1/1 ... [2025-12-04 11:07:01.513544][6900.293518619] 2025-12-04T11:07:01.5139135Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:07:01.5142104Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:07:01.513858] 2025-12-04T11:07:07.6903827Z 2025-12-04T11:07:07.6905484Z inductor/test_aot_inductor_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_utils_1.1_9a97251034915f58_.log 2025-12-04T11:07:07.6907085Z Running 0 items in this shard: 2025-12-04T11:07:07.6907464Z 2025-12-04T11:07:07.6908066Z Finished inductor/test_aot_inductor_utils 1/1 ... [2025-12-04 11:07:07.689937][6906.469909699], took 0.10min 2025-12-04T11:07:07.6980907Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-fd83be05005cff21.xml 2025-12-04T11:07:07.7990630Z Running inductor/test_control_flow 3/4 ... [2025-12-04 11:07:07.798685][6906.578657664] 2025-12-04T11:07:07.7991107Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:07:07.7995014Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_control_flow.py', '--shard-id=3', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:07:07.799091] 2025-12-04T11:07:15.3281598Z 2025-12-04T11:07:15.3282598Z inductor/test_control_flow 3/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_control_flow_3.4_c7b62bd639790f0d_.log 2025-12-04T11:07:15.3283375Z Running 0 items in this shard: 2025-12-04T11:07:15.3283564Z 2025-12-04T11:07:15.3283846Z Finished inductor/test_control_flow 3/4 ... [2025-12-04 11:07:15.327770][6914.107742544], took 0.13min 2025-12-04T11:07:15.3359715Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-9924173763b05f6a.xml 2025-12-04T11:07:15.4074857Z Running test_cpp_api_parity 1/1 ... [2025-12-04 11:07:15.407075][6914.18704856] 2025-12-04T11:07:15.4075301Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:07:15.4078600Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_api_parity.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:07:15.407468] 2025-12-04T11:07:24.9413869Z 2025-12-04T11:07:24.9415070Z test_cpp_api_parity 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_api_parity_1.1_9be5968028a54dc2_.log 2025-12-04T11:07:24.9415761Z Running 0 items in this shard: 2025-12-04T11:07:24.9415939Z 2025-12-04T11:07:24.9416211Z Finished test_cpp_api_parity 1/1 ... [2025-12-04 11:07:24.941062][6923.721035039], took 0.16min 2025-12-04T11:07:24.9503049Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-9bd31207a046932a.xml 2025-12-04T11:07:25.0282144Z Running test_foreach 2/2 ... [2025-12-04 11:07:25.027797][6923.807769731] 2025-12-04T11:07:25.0282667Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:07:25.0285489Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_foreach.py', '--shard-id=2', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:07:25.028177] 2025-12-04T11:07:56.0527588Z 2025-12-04T11:07:56.0528386Z test_foreach 2/2 was successful, full logs can be found in artifacts with path test/test-reports/test_foreach_2.2_7bef0d80f11b45cf_.log 2025-12-04T11:07:56.0605589Z Running 200 items in this shard: test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64 2025-12-04T11:07:56.0692453Z 2025-12-04T11:07:56.0692703Z Finished test_foreach 2/2 ... [2025-12-04 11:07:56.052341][6954.832312865], took 0.52min 2025-12-04T11:07:56.0693583Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_foreach/test_foreach-57977e6bb7997bc3.xml 2025-12-04T11:07:56.1428302Z Running nn/test_packed_sequence 1/1 ... [2025-12-04 11:07:56.142423][6954.92239562] 2025-12-04T11:07:56.1429151Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:07:56.1431993Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_packed_sequence.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:07:56.142789] 2025-12-04T11:07:59.4136398Z 2025-12-04T11:07:59.4137152Z nn/test_packed_sequence 1/1 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_packed_sequence_1.1_be9e7691d673e5c2_.log 2025-12-04T11:07:59.4137905Z Running 0 items in this shard: 2025-12-04T11:07:59.4138087Z 2025-12-04T11:07:59.4138358Z Finished nn/test_packed_sequence 1/1 ... [2025-12-04 11:07:59.413297][6958.193270371], took 0.05min 2025-12-04T11:07:59.4219970Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/nn.test_packed_sequence/nn.test_packed_sequence-dcd68b3499aab515.xml 2025-12-04T11:07:59.4556311Z Running test_numa_binding 1/1 ... [2025-12-04 11:07:59.455339][6958.235312445] 2025-12-04T11:07:59.4556739Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:07:59.4560704Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_numa_binding.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:07:59.455717] 2025-12-04T11:08:02.7262566Z 2025-12-04T11:08:02.7263319Z test_numa_binding 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_numa_binding_1.1_c202ae5c6ae9ae6d_.log 2025-12-04T11:08:02.7264015Z Running 0 items in this shard: 2025-12-04T11:08:02.7264194Z 2025-12-04T11:08:02.7264451Z Finished test_numa_binding 1/1 ... [2025-12-04 11:08:02.725939][6961.50591165], took 0.05min 2025-12-04T11:08:02.7347513Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_numa_binding/test_numa_binding-62dfaebcfc08487c.xml 2025-12-04T11:08:02.7602614Z Running test_pruning_op 1/1 ... [2025-12-04 11:08:02.759877][6961.539851527] 2025-12-04T11:08:02.7603055Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:02.7606766Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_pruning_op.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:02.760270] 2025-12-04T11:08:06.0314231Z 2025-12-04T11:08:06.0315241Z test_pruning_op 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_pruning_op_1.1_fe9ee687d0fcc012_.log 2025-12-04T11:08:06.0315907Z Running 0 items in this shard: 2025-12-04T11:08:06.0316115Z 2025-12-04T11:08:06.0316354Z Finished test_pruning_op 1/1 ... [2025-12-04 11:08:06.031041][6964.811014711], took 0.05min 2025-12-04T11:08:06.0403006Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_pruning_op/test_pruning_op-d12712c8cd1c0003.xml 2025-12-04T11:08:06.0677576Z Running test_jit_fuser_te 1/1 ... [2025-12-04 11:08:06.067435][6964.847408866] 2025-12-04T11:08:06.0678023Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:06.0681304Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_jit_fuser_te.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:06.067761] 2025-12-04T11:08:24.8664407Z 2025-12-04T11:08:24.8665352Z test_jit_fuser_te 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_jit_fuser_te_1.1_fa00452da479ebdd_.log 2025-12-04T11:08:24.8701836Z Running 150 items in this shard: test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserStatic::test_torch_to, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_inlined_optimized_graph, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check, test/test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check 2025-12-04T11:08:24.8737098Z 2025-12-04T11:08:24.8737370Z Finished test_jit_fuser_te 1/1 ... [2025-12-04 11:08:24.866551][6983.646518893], took 0.31min 2025-12-04T11:08:24.8756814Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_jit_fuser_te/test_jit_fuser_te-0670cf00619133fb.xml 2025-12-04T11:08:24.9667196Z Running optim/test_lrscheduler 1/1 ... [2025-12-04 11:08:24.966329][6983.746301431] 2025-12-04T11:08:24.9667927Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:24.9671270Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'optim/test_lrscheduler.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:24.966668] 2025-12-04T11:08:28.1389380Z 2025-12-04T11:08:28.1390359Z optim/test_lrscheduler 1/1 was successful, full logs can be found in artifacts with path test/test-reports/optim.test_lrscheduler_1.1_1d3afec10ecbb641_.log 2025-12-04T11:08:28.1391013Z 2025-12-04T11:08:28.1391290Z Finished optim/test_lrscheduler 1/1 ... [2025-12-04 11:08:28.138478][6986.918448877], took 0.05min 2025-12-04T11:08:28.1490421Z Running torch_np/numpy_tests/core/test_indexing 1/1 ... [2025-12-04 11:08:28.148671][6986.928646239] 2025-12-04T11:08:28.1490940Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:28.1493983Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'torch_np/numpy_tests/core/test_indexing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:28.149016] 2025-12-04T11:08:31.5198141Z 2025-12-04T11:08:31.5199550Z torch_np/numpy_tests/core/test_indexing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/torch_np.numpy_tests.core.test_indexing_1.1_d443f17b87365435_.log 2025-12-04T11:08:31.5200811Z Running 0 items in this shard: 2025-12-04T11:08:31.5201032Z 2025-12-04T11:08:31.5201369Z Finished torch_np/numpy_tests/core/test_indexing 1/1 ... [2025-12-04 11:08:31.519405][6990.299378346], took 0.06min 2025-12-04T11:08:31.5288055Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/torch_np.numpy_tests.core.test_indexing/torch_np.numpy_tests.core.test_indexing-f6308bf864aa318d.xml 2025-12-04T11:08:31.6115697Z Running test_futures 1/1 ... [2025-12-04 11:08:31.611226][6990.391198687] 2025-12-04T11:08:31.6116205Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:31.6120613Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_futures.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:31.611565] 2025-12-04T11:08:34.8825303Z 2025-12-04T11:08:34.8826304Z test_futures 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_futures_1.1_8685b00496b368c3_.log 2025-12-04T11:08:34.8827169Z Running 0 items in this shard: 2025-12-04T11:08:34.8827354Z 2025-12-04T11:08:34.8827580Z Finished test_futures 1/1 ... [2025-12-04 11:08:34.882176][6993.662148326], took 0.05min 2025-12-04T11:08:34.8916810Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_futures/test_futures-eeccb63775c4add0.xml 2025-12-04T11:08:34.9161324Z Running test_tensor_creation_ops 1/1 ... [2025-12-04 11:08:34.915705][6993.695679037] 2025-12-04T11:08:34.9161779Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:34.9164965Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_tensor_creation_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:34.916075] 2025-12-04T11:08:39.5900222Z 2025-12-04T11:08:39.5901401Z test_tensor_creation_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_tensor_creation_ops_1.1_0beed75c1e86aa59_.log 2025-12-04T11:08:39.5902181Z Running 0 items in this shard: 2025-12-04T11:08:39.5902359Z 2025-12-04T11:08:39.5902630Z Finished test_tensor_creation_ops 1/1 ... [2025-12-04 11:08:39.589507][6998.369479997], took 0.08min 2025-12-04T11:08:39.5991297Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-d21bc49881a2e2b0.xml 2025-12-04T11:08:39.6259737Z Running test_scaled_matmul_cuda 1/1 ... [2025-12-04 11:08:39.625604][6998.405577638] 2025-12-04T11:08:39.6260243Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:39.6263122Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_scaled_matmul_cuda.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:39.625951] 2025-12-04T11:08:45.0009193Z 2025-12-04T11:08:45.0010218Z test_scaled_matmul_cuda 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_scaled_matmul_cuda_1.1_dada766423d50405_.log 2025-12-04T11:08:45.0024972Z Running 50 items in this shard: test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda, test/test_scaled_matmul_cuda.py::TestFP8MatmulCUDA::test_honor_sm_carveout_cuda 2025-12-04T11:08:45.0039033Z 2025-12-04T11:08:45.0039299Z Finished test_scaled_matmul_cuda 1/1 ... [2025-12-04 11:08:45.000594][7003.780566384], took 0.09min 2025-12-04T11:08:45.0107913Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_scaled_matmul_cuda/test_scaled_matmul_cuda-d9bde1d5292f9755.xml 2025-12-04T11:08:45.0467994Z Running torch_np/numpy_tests/core/test_shape_base 1/1 ... [2025-12-04 11:08:45.046358][7003.826332248] 2025-12-04T11:08:45.0468510Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:45.0470991Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'torch_np/numpy_tests/core/test_shape_base.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:45.046719] 2025-12-04T11:08:48.5684050Z 2025-12-04T11:08:48.5685077Z torch_np/numpy_tests/core/test_shape_base 1/1 was successful, full logs can be found in artifacts with path test/test-reports/torch_np.numpy_tests.core.test_shape_base_1.1_9d39f3ab13cf756c_.log 2025-12-04T11:08:48.5685959Z Running 0 items in this shard: 2025-12-04T11:08:48.5686143Z 2025-12-04T11:08:48.5686738Z Finished torch_np/numpy_tests/core/test_shape_base 1/1 ... [2025-12-04 11:08:48.568023][7007.34799614], took 0.06min 2025-12-04T11:08:48.5781228Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/torch_np.numpy_tests.core.test_shape_base/torch_np.numpy_tests.core.test_shape_base-5523c0bbcadda606.xml 2025-12-04T11:08:48.6025827Z Running test_vulkan 1/1 ... [2025-12-04 11:08:48.602246][7007.382219841] 2025-12-04T11:08:48.6026379Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:48.6029532Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_vulkan.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:48.602568] 2025-12-04T11:08:51.8233985Z 2025-12-04T11:08:51.8234767Z test_vulkan 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_vulkan_1.1_a3b26789c3daa3f3_.log 2025-12-04T11:08:51.8235738Z Running 0 items in this shard: 2025-12-04T11:08:51.8235998Z 2025-12-04T11:08:51.8236337Z Finished test_vulkan 1/1 ... [2025-12-04 11:08:51.823067][7010.603039887], took 0.05min 2025-12-04T11:08:51.8344075Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_vulkan/test_vulkan-b9421a80edb6f140.xml 2025-12-04T11:08:51.8609579Z Running lazy/test_generator 1/1 ... [2025-12-04 11:08:51.860650][7010.640624001] 2025-12-04T11:08:51.8610020Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:51.8613735Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'lazy/test_generator.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:51.861019] 2025-12-04T11:08:55.1318065Z 2025-12-04T11:08:55.1318913Z lazy/test_generator 1/1 was successful, full logs can be found in artifacts with path test/test-reports/lazy.test_generator_1.1_2f4441dec655dc71_.log 2025-12-04T11:08:55.1319685Z Running 0 items in this shard: 2025-12-04T11:08:55.1319885Z 2025-12-04T11:08:55.1320144Z Finished lazy/test_generator 1/1 ... [2025-12-04 11:08:55.131455][7013.911428648], took 0.05min 2025-12-04T11:08:55.1417013Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/lazy.test_generator/lazy.test_generator-6425516431ce47a1.xml 2025-12-04T11:08:55.1653280Z Running nn/test_convolution 1/2 ... [2025-12-04 11:08:55.164952][7013.944926156] 2025-12-04T11:08:55.1653725Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T11:08:55.1657199Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_convolution.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 11:08:55.165294] 2025-12-04T11:09:00.3393440Z 2025-12-04T11:09:00.3394657Z nn/test_convolution 1/2 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_convolution_1.2_1888800342d74d79_.log 2025-12-04T11:09:00.3395364Z Running 0 items in this shard: 2025-12-04T11:09:00.3395550Z 2025-12-04T11:09:00.3395805Z Finished nn/test_convolution 1/2 ... [2025-12-04 11:09:00.338969][7019.118941564], took 0.09min 2025-12-04T11:09:00.3494695Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-b82571327b265a63.xml 2025-12-04T11:09:04.1489057Z Running test batch 'tests to run' cost 6085.0 seconds 2025-12-04T11:09:04.1502506Z Emitting td_test_failure_stats_v2 2025-12-04T11:09:04.1506469Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764846544_a4f9f6d2d10111f0a5f90242ac110002 2025-12-04T11:09:04.2663278Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764846544_a4f9f6d2d10111f0a5f90242ac110002 2025-12-04T11:09:04.2664003Z inductor/test_pattern_matcher 1/1 failed! 2025-12-04T11:09:05.0016316Z 2025-12-04T11:09:05.0016681Z real 101m31.117s 2025-12-04T11:09:05.0017013Z user 115m22.888s 2025-12-04T11:09:05.0017327Z sys 34m58.264s 2025-12-04T11:09:05.0017628Z + assert_git_not_dirty 2025-12-04T11:09:05.0018340Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *rocm* ]] 2025-12-04T11:09:05.0018980Z + [[ linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck != *xla* ]] 2025-12-04T11:09:05.0025166Z ++ git status --porcelain 2025-12-04T11:09:05.0025994Z ++ grep -v '?? third_party' 2025-12-04T11:09:09.2821952Z ++ true 2025-12-04T11:09:09.2824066Z + git_status= 2025-12-04T11:09:09.2824496Z + [[ -n '' ]] 2025-12-04T11:09:09.2826525Z + sccache_epilogue 2025-12-04T11:09:09.2826869Z + echo '::group::Sccache Compilation Log' 2025-12-04T11:09:09.2827557Z ##[group]Sccache Compilation Log 2025-12-04T11:09:09.2827898Z + echo '=================== sccache compilation log ===================' 2025-12-04T11:09:09.2828287Z =================== sccache compilation log =================== 2025-12-04T11:09:09.2828893Z + python /var/lib/jenkins/workspace/.ci/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 2025-12-04T11:09:09.2972989Z + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 2025-12-04T11:09:09.2973746Z =========== If your build fails, please take a look at the log above for possible reasons =========== 2025-12-04T11:09:09.2974197Z + sccache --show-stats 2025-12-04T11:09:09.3002333Z Compile requests 5515 2025-12-04T11:09:09.3002769Z Compile requests executed 151 2025-12-04T11:09:09.3003196Z Cache hits 58 2025-12-04T11:09:09.3003825Z Cache hits (C/C++) 58 2025-12-04T11:09:09.3004227Z Cache misses 93 2025-12-04T11:09:09.3004540Z Cache misses (C/C++) 93 2025-12-04T11:09:09.3004851Z Cache hits rate 38.41 % 2025-12-04T11:09:09.3005145Z Cache hits rate (C/C++) 38.41 % 2025-12-04T11:09:09.3005437Z Cache timeouts 0 2025-12-04T11:09:09.3005741Z Cache read errors 0 2025-12-04T11:09:09.3006037Z Forced recaches 0 2025-12-04T11:09:09.3006325Z Cache write errors 0 2025-12-04T11:09:09.3006606Z Cache errors 0 2025-12-04T11:09:09.3006889Z Compilations 93 2025-12-04T11:09:09.3007182Z Compilation failures 0 2025-12-04T11:09:09.3007479Z Non-cacheable compilations 0 2025-12-04T11:09:09.3007782Z Non-cacheable calls 403 2025-12-04T11:09:09.3008086Z Non-compilation calls 4961 2025-12-04T11:09:09.3008389Z Unsupported compiler calls 0 2025-12-04T11:09:09.3008692Z Average cache write 0.042 s 2025-12-04T11:09:09.3009049Z Average compiler 7.425 s 2025-12-04T11:09:09.3009545Z Average cache read hit 0.025 s 2025-12-04T11:09:09.3009975Z Failed distributed compilations 0 2025-12-04T11:09:09.3010266Z 2025-12-04T11:09:09.3010503Z Non-cacheable reasons: 2025-12-04T11:09:09.3010754Z -E 353 2025-12-04T11:09:09.3011042Z unknown source language 50 2025-12-04T11:09:09.3011244Z 2025-12-04T11:09:09.3011466Z Cache location s3, name: ossci-compiler-cache-circleci-v2, prefix: / 2025-12-04T11:09:09.3011885Z Version (client) 0.10.0 2025-12-04T11:09:09.3012178Z + sccache --stop-server 2025-12-04T11:09:09.3033090Z Stopping sccache server... 2025-12-04T11:09:09.3037184Z Compile requests 5515 2025-12-04T11:09:09.3037661Z Compile requests executed 151 2025-12-04T11:09:09.3038086Z Cache hits 58 2025-12-04T11:09:09.3038474Z Cache hits (C/C++) 58 2025-12-04T11:09:09.3038821Z Cache misses 93 2025-12-04T11:09:09.3039244Z Cache misses (C/C++) 93 2025-12-04T11:09:09.3039654Z Cache hits rate 38.41 % 2025-12-04T11:09:09.3039965Z Cache hits rate (C/C++) 38.41 % 2025-12-04T11:09:09.3040263Z Cache timeouts 0 2025-12-04T11:09:09.3040545Z Cache read errors 0 2025-12-04T11:09:09.3040927Z Forced recaches 0 2025-12-04T11:09:09.3041212Z Cache write errors 0 2025-12-04T11:09:09.3041490Z Cache errors 0 2025-12-04T11:09:09.3041774Z Compilations 93 2025-12-04T11:09:09.3042064Z Compilation failures 0 2025-12-04T11:09:09.3042367Z Non-cacheable compilations 0 2025-12-04T11:09:09.3042667Z Non-cacheable calls 403 2025-12-04T11:09:09.3042962Z Non-compilation calls 4961 2025-12-04T11:09:09.3043283Z Unsupported compiler calls 0 2025-12-04T11:09:09.3043587Z Average cache write 0.042 s 2025-12-04T11:09:09.3043894Z Average compiler 7.425 s 2025-12-04T11:09:09.3044194Z Average cache read hit 0.025 s 2025-12-04T11:09:09.3044594Z Failed distributed compilations 0 2025-12-04T11:09:09.3044861Z 2025-12-04T11:09:09.3044957Z Non-cacheable reasons: 2025-12-04T11:09:09.3045197Z -E 353 2025-12-04T11:09:09.3045495Z unknown source language 50 2025-12-04T11:09:09.3045689Z 2025-12-04T11:09:09.3045907Z Cache location s3, name: ossci-compiler-cache-circleci-v2, prefix: / 2025-12-04T11:09:09.3046328Z Version (client) 0.10.0 2025-12-04T11:09:09.3046631Z + echo ::endgroup:: 2025-12-04T11:09:09.3047061Z ##[endgroup] 2025-12-04T11:09:09.3047334Z + cleanup_workspace 2025-12-04T11:09:09.3047797Z + echo 'sudo may print the following warning message that can be ignored. The chown command will still run.' 2025-12-04T11:09:09.3048520Z sudo may print the following warning message that can be ignored. The chown command will still run. 2025-12-04T11:09:09.3049103Z + echo ' sudo: setrlimit(RLIMIT_STACK): Operation not permitted' 2025-12-04T11:09:09.3049622Z sudo: setrlimit(RLIMIT_STACK): Operation not permitted 2025-12-04T11:09:09.3050159Z + echo 'For more details refer to https://github.com/sudo-project/sudo/issues/42' 2025-12-04T11:09:09.3050712Z For more details refer to https://github.com/sudo-project/sudo/issues/42 2025-12-04T11:09:09.3051211Z + sudo chown -R 1000 /var/lib/jenkins/workspace 2025-12-04T11:09:10.3814043Z ##[group]Run pytorch/test-infra/.github/actions/upload-benchmark-results@main 2025-12-04T11:09:10.3814498Z with: 2025-12-04T11:09:10.3814739Z benchmark-results-dir: test/test-reports 2025-12-04T11:09:10.3815049Z dry-run: false 2025-12-04T11:09:10.3815278Z schema-version: v3 2025-12-04T11:09:10.3815692Z github-token: *** 2025-12-04T11:09:10.3815912Z env: 2025-12-04T11:09:10.3816112Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:09:10.3816366Z HAS_NVIDIA_GPU: true 2025-12-04T11:09:10.3816665Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:09:10.3817177Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:09:10.3817640Z ##[endgroup] 2025-12-04T11:09:10.3845641Z ##[group]Run set -eux 2025-12-04T11:09:10.3845906Z set -eux 2025-12-04T11:09:10.3846122Z  2025-12-04T11:09:10.3846338Z if [[ -n "" ]]; then 2025-12-04T11:09:10.3846590Z  source "" 2025-12-04T11:09:10.3846841Z fi 2025-12-04T11:09:10.3847175Z python3 -mpip install boto3==1.35.33 psutil==7.0.0 pynvml==12.0.0 2025-12-04T11:09:10.3847570Z  2025-12-04T11:09:10.3847779Z DEVICE_NAME="" 2025-12-04T11:09:10.3848032Z DEVICE_TYPE="" 2025-12-04T11:09:10.3848261Z  2025-12-04T11:09:10.3848482Z if command -v nvidia-smi; then 2025-12-04T11:09:10.3848927Z  # NB: I'm using PyTorch here to get the device name, however, it needs to 2025-12-04T11:09:10.3849487Z  # install the correct version of PyTorch manually for now. Any PyTorch 2025-12-04T11:09:10.3849996Z  # version is fine, I just use 2.7.1 to satify PYPIDEP linter 2025-12-04T11:09:10.3850399Z  python3 -mpip install torch==2.7.1 2025-12-04T11:09:10.3850724Z elif command -v rocminfo; then 2025-12-04T11:09:10.3851237Z  # NB: Installing torch on ROCm runner with pip here causes CI to fail 2025-12-04T11:09:10.3851764Z  # with a memoryview is too large error only on MI300 runners. Is pip 2025-12-04T11:09:10.3852276Z  # version on ROCm runner there too old? As a workaround, let's use the 2025-12-04T11:09:10.3852733Z  # GPU device name coming from rocminfo instead 2025-12-04T11:09:10.3853079Z  DEVICE_NAME=rocm 2025-12-04T11:09:10.3853532Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2025-12-04T11:09:10.3853992Z fi 2025-12-04T11:09:10.3854190Z  2025-12-04T11:09:10.3854448Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2025-12-04T11:09:10.3854833Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2025-12-04T11:09:10.3869196Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:09:10.3869552Z env: 2025-12-04T11:09:10.3869763Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:09:10.3870024Z HAS_NVIDIA_GPU: true 2025-12-04T11:09:10.3870330Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:09:10.3870844Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:09:10.3871303Z ##[endgroup] 2025-12-04T11:09:10.3914570Z + [[ -n '' ]] 2025-12-04T11:09:10.3915623Z + python3 -mpip install boto3==1.35.33 psutil==7.0.0 pynvml==12.0.0 2025-12-04T11:09:10.6355132Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T11:09:11.9111475Z Collecting boto3==1.35.33 2025-12-04T11:09:11.9283550Z Downloading boto3-1.35.33-py3-none-any.whl (139 kB) 2025-12-04T11:09:12.2932448Z Collecting psutil==7.0.0 2025-12-04T11:09:12.2971792Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2025-12-04T11:09:12.3274699Z Collecting pynvml==12.0.0 2025-12-04T11:09:12.3303589Z Downloading pynvml-12.0.0-py3-none-any.whl (26 kB) 2025-12-04T11:09:13.6761851Z Collecting botocore<1.36.0,>=1.35.33 2025-12-04T11:09:13.6795082Z Downloading botocore-1.35.99-py3-none-any.whl (13.3 MB) 2025-12-04T11:09:13.8188591Z Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/lib/python3.9/site-packages (from boto3==1.35.33) (0.10.0) 2025-12-04T11:09:13.8645992Z Collecting s3transfer<0.11.0,>=0.10.0 2025-12-04T11:09:13.8682483Z Downloading s3transfer-0.10.4-py3-none-any.whl (83 kB) 2025-12-04T11:09:13.9196233Z Collecting nvidia-ml-py<13.0.0a0,>=12.0.0 2025-12-04T11:09:13.9224955Z Downloading nvidia_ml_py-12.575.51-py3-none-any.whl (47 kB) 2025-12-04T11:09:13.9323579Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.33->boto3==1.35.33) (2.8.1) 2025-12-04T11:09:13.9332957Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /usr/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.33->boto3==1.35.33) (1.25.10) 2025-12-04T11:09:14.0922544Z Requirement already satisfied: six>=1.5 in /usr/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.36.0,>=1.35.33->boto3==1.35.33) (1.15.0) 2025-12-04T11:09:14.2306432Z Installing collected packages: botocore, s3transfer, nvidia-ml-py, pynvml, psutil, boto3 2025-12-04T11:09:14.8001335Z Attempting uninstall: nvidia-ml-py 2025-12-04T11:09:14.8003247Z Found existing installation: nvidia-ml-py 11.525.84 2025-12-04T11:09:14.8017707Z Uninstalling nvidia-ml-py-11.525.84: 2025-12-04T11:09:14.8270804Z Successfully uninstalled nvidia-ml-py-11.525.84 2025-12-04T11:09:14.8883217Z Attempting uninstall: psutil 2025-12-04T11:09:14.8883843Z Found existing installation: psutil 5.9.8 2025-12-04T11:09:14.8974360Z Uninstalling psutil-5.9.8: 2025-12-04T11:09:14.8981849Z Successfully uninstalled psutil-5.9.8 2025-12-04T11:09:15.0682722Z Successfully installed boto3-1.35.33 botocore-1.35.99 nvidia-ml-py-12.575.51 psutil-7.0.0 pynvml-12.0.0 s3transfer-0.10.4 2025-12-04T11:09:15.1970780Z + DEVICE_NAME= 2025-12-04T11:09:15.1971692Z + DEVICE_TYPE= 2025-12-04T11:09:15.1971988Z + command -v nvidia-smi 2025-12-04T11:09:15.1972302Z + python3 -mpip install torch==2.7.1 2025-12-04T11:09:15.1972600Z /usr/bin/nvidia-smi 2025-12-04T11:09:15.4344990Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T11:09:15.7368314Z Collecting torch==2.7.1 2025-12-04T11:09:15.7604263Z Downloading torch-2.7.1-cp39-cp39-manylinux_2_28_x86_64.whl (821.1 MB) 2025-12-04T11:09:27.6162259Z Collecting nvidia-cuda-cupti-cu12==12.6.80 2025-12-04T11:09:27.6234261Z Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.9 MB) 2025-12-04T11:09:27.7274343Z Collecting nvidia-cusparselt-cu12==0.6.3 2025-12-04T11:09:27.7311231Z Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-12-04T11:09:29.3178234Z Collecting nvidia-cufft-cu12==11.3.0.4 2025-12-04T11:09:29.3243956Z Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB) 2025-12-04T11:09:31.5470761Z Collecting nvidia-nvtx-cu12==12.6.77 2025-12-04T11:09:31.5509387Z Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-12-04T11:09:31.5841809Z Collecting nvidia-curand-cu12==10.3.7.77 2025-12-04T11:09:31.5908359Z Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (56.3 MB) 2025-12-04T11:09:32.1697872Z Collecting sympy>=1.13.3 2025-12-04T11:09:32.1733330Z Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB) 2025-12-04T11:09:32.2705085Z Collecting nvidia-cuda-runtime-cu12==12.6.77 2025-12-04T11:09:32.2768947Z Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB) 2025-12-04T11:09:32.3197364Z Collecting nvidia-nvjitlink-cu12==12.6.85 2025-12-04T11:09:32.3349297Z Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-12-04T11:09:32.5586014Z Collecting nvidia-nccl-cu12==2.26.2 2025-12-04T11:09:32.5661956Z Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-12-04T11:09:34.8264047Z Collecting filelock 2025-12-04T11:09:34.8300227Z Downloading filelock-3.19.1-py3-none-any.whl (15 kB) 2025-12-04T11:09:34.8624389Z Collecting nvidia-cublas-cu12==12.6.4.1 2025-12-04T11:09:34.8696081Z Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-12-04T11:09:40.3029953Z Collecting networkx 2025-12-04T11:09:40.3069274Z Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-12-04T11:09:40.3276785Z Requirement already satisfied: jinja2 in /usr/lib/python3.9/site-packages (from torch==2.7.1) (2.11.3) 2025-12-04T11:09:40.3583165Z Collecting nvidia-cusparse-cu12==12.5.4.2 2025-12-04T11:09:40.3645734Z Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (216.6 MB) 2025-12-04T11:09:42.7904922Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 2025-12-04T11:09:42.7992325Z Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-12-04T11:09:43.0515990Z Collecting nvidia-cusolver-cu12==11.7.1.2 2025-12-04T11:09:43.0576691Z Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (158.2 MB) 2025-12-04T11:09:44.6979273Z Collecting fsspec 2025-12-04T11:09:44.7017885Z Downloading fsspec-2025.10.0-py3-none-any.whl (200 kB) 2025-12-04T11:09:44.7071363Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/.local/lib/python3.9/site-packages (from torch==2.7.1) (4.15.0) 2025-12-04T11:09:44.7543598Z Collecting triton==3.3.1 2025-12-04T11:09:44.7612281Z Downloading triton-3.3.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.6 MB) 2025-12-04T11:09:46.3246946Z Collecting nvidia-cudnn-cu12==9.5.1.17 2025-12-04T11:09:46.3312117Z Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-12-04T11:09:54.3189665Z Collecting nvidia-cufile-cu12==1.11.1.6 2025-12-04T11:09:54.3259553Z Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-12-04T11:09:54.3688319Z Requirement already satisfied: setuptools>=40.8.0 in /usr/lib/python3.9/site-packages (from triton==3.3.1->torch==2.7.1) (59.6.0) 2025-12-04T11:09:54.4007521Z Collecting mpmath<1.4,>=1.1.0 2025-12-04T11:09:54.4046479Z Downloading mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-12-04T11:09:54.4998625Z Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib64/python3.9/site-packages (from jinja2->torch==2.7.1) (1.1.1) 2025-12-04T11:09:54.8559357Z Installing collected packages: nvidia-nvjitlink-cu12, nvidia-cusparse-cu12, nvidia-cublas-cu12, mpmath, triton, sympy, nvidia-nvtx-cu12, nvidia-nccl-cu12, nvidia-cusparselt-cu12, nvidia-cusolver-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, networkx, fsspec, filelock, torch 2025-12-04T11:10:04.2209259Z WARNING: The scripts proton and proton-viewer are installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T11:10:04.2210117Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T11:10:08.3028321Z WARNING: The script isympy is installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T11:10:08.3029361Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T11:10:39.6472482Z WARNING: The scripts torchfrtrace and torchrun are installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T11:10:39.6473338Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T11:10:39.7445326Z Successfully installed filelock-3.19.1 fsspec-2025.10.0 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 sympy-1.14.0 torch-2.7.1 triton-3.3.1 2025-12-04T11:10:40.4285875Z + echo DEVICE_NAME= 2025-12-04T11:10:40.4287337Z + echo DEVICE_TYPE= 2025-12-04T11:10:40.4316605Z ##[group]Run set -eux 2025-12-04T11:10:40.4316858Z set -eux 2025-12-04T11:10:40.4317079Z  2025-12-04T11:10:40.4317314Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2025-12-04T11:10:40.4317647Z  echo "Missing github-token input" 2025-12-04T11:10:40.4317956Z  exit 1 2025-12-04T11:10:40.4318172Z fi 2025-12-04T11:10:40.4328341Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:40.4328702Z env: 2025-12-04T11:10:40.4328913Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:40.4329174Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:40.4329528Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:40.4330047Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:40.4330510Z DEVICE_NAME: 2025-12-04T11:10:40.4330720Z DEVICE_TYPE: 2025-12-04T11:10:40.4331122Z GITHUB_TOKEN: *** 2025-12-04T11:10:40.4331345Z ##[endgroup] 2025-12-04T11:10:40.4366319Z + [[ -z *** ]] 2025-12-04T11:10:40.4421922Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2025-12-04T11:10:40.4422434Z with: 2025-12-04T11:10:40.4422778Z github-token: *** 2025-12-04T11:10:40.4423014Z env: 2025-12-04T11:10:40.4423221Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:40.4423480Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:40.4423786Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:40.4424446Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:40.4424925Z DEVICE_NAME: 2025-12-04T11:10:40.4425145Z DEVICE_TYPE: 2025-12-04T11:10:40.4425351Z ##[endgroup] 2025-12-04T11:10:40.4440353Z ##[group]Run set -eux 2025-12-04T11:10:40.4440609Z set -eux 2025-12-04T11:10:40.4440827Z  2025-12-04T11:10:40.4441340Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T11:10:40.4450790Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:40.4451145Z env: 2025-12-04T11:10:40.4451351Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:40.4451605Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:40.4451897Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:40.4452441Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:40.4452923Z DEVICE_NAME: 2025-12-04T11:10:40.4453139Z DEVICE_TYPE: 2025-12-04T11:10:40.4453490Z GITHUB_TOKEN: *** 2025-12-04T11:10:40.4453711Z ##[endgroup] 2025-12-04T11:10:40.4484977Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 19922826259 i-0513695dee1ce902e 2025-12-04T11:10:43.2155131Z setting job-id=57118183207 2025-12-04T11:10:43.2156152Z setting job-name=linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T11:10:43.2267495Z ##[group]Run set -eux 2025-12-04T11:10:43.2267767Z set -eux 2025-12-04T11:10:43.2267981Z  2025-12-04T11:10:43.2268200Z if [[ -n "" ]]; then 2025-12-04T11:10:43.2268468Z  source "" 2025-12-04T11:10:43.2268699Z fi 2025-12-04T11:10:43.2268906Z  2025-12-04T11:10:43.2269282Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2025-12-04T11:10:43.2269784Z  --schema-version "${SCHEMA_VERSION}" \ 2025-12-04T11:10:43.2270306Z  --repo "${REPO}" \ 2025-12-04T11:10:43.2270606Z  --head-branch "${HEAD_BRANCH}" \ 2025-12-04T11:10:43.2270930Z  --head-sha "${HEAD_SHA}" \ 2025-12-04T11:10:43.2271258Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2025-12-04T11:10:43.2271611Z  --run-attempt "${RUN_ATTEMPT}" \ 2025-12-04T11:10:43.2271944Z  --job-id "${JOB_ID}" \ 2025-12-04T11:10:43.2272242Z  --job-name "${JOB_NAME}" 2025-12-04T11:10:43.2281369Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:43.2281715Z env: 2025-12-04T11:10:43.2281925Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:43.2282188Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:43.2282486Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:43.2283127Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:43.2283707Z DEVICE_NAME: 2025-12-04T11:10:43.2283929Z DEVICE_TYPE: 2025-12-04T11:10:43.2284161Z SCHEMA_VERSION: v3 2025-12-04T11:10:43.2284419Z REPO: pytorch/pytorch 2025-12-04T11:10:43.2284688Z HEAD_BRANCH: refs/heads/main 2025-12-04T11:10:43.2285024Z HEAD_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T11:10:43.2285392Z WORKFLOW_RUN_ID: 19922826259 2025-12-04T11:10:43.2285668Z RUN_ATTEMPT: 1 2025-12-04T11:10:43.2285894Z JOB_ID: 57118183207 2025-12-04T11:10:43.2286639Z JOB_NAME: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T11:10:43.2287445Z ##[endgroup] 2025-12-04T11:10:43.2319215Z + [[ -n '' ]] 2025-12-04T11:10:43.2321681Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/pytorch --head-branch refs/heads/main --head-sha ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 --workflow-id 19922826259 --run-attempt 1 --job-id 57118183207 --job-name 'linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests)' 2025-12-04T11:10:43.2729460Z ##[group]Run set -eux 2025-12-04T11:10:43.2729727Z set -eux 2025-12-04T11:10:43.2729961Z  2025-12-04T11:10:43.2730167Z if [[ -n "" ]]; then 2025-12-04T11:10:43.2730430Z  source "" 2025-12-04T11:10:43.2730660Z fi 2025-12-04T11:10:43.2730862Z  2025-12-04T11:10:43.2731234Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2025-12-04T11:10:43.2740541Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:43.2740897Z env: 2025-12-04T11:10:43.2741099Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:43.2741356Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:43.2741660Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:43.2742179Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:43.2742652Z DEVICE_NAME: 2025-12-04T11:10:43.2742870Z DEVICE_TYPE: 2025-12-04T11:10:43.2743110Z ##[endgroup] 2025-12-04T11:10:43.2773651Z + [[ -n '' ]] 2025-12-04T11:10:43.2774572Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/benchmarks/gather_runners_info.py 2025-12-04T11:10:44.2443561Z /home/ec2-user/.local/lib/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) 2025-12-04T11:10:44.2444721Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T11:10:45.2071177Z ##[group]Run set -eux 2025-12-04T11:10:45.2071433Z set -eux 2025-12-04T11:10:45.2071652Z  2025-12-04T11:10:45.2071904Z # TODO (huydhn): Implement this part 2025-12-04T11:10:45.2072267Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2025-12-04T11:10:45.2083059Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:45.2083420Z env: 2025-12-04T11:10:45.2083622Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:45.2083884Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:45.2084236Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:45.2084752Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:45.2085209Z DEVICE_NAME: 2025-12-04T11:10:45.2085424Z DEVICE_TYPE: 2025-12-04T11:10:45.2085634Z ##[endgroup] 2025-12-04T11:10:45.2116985Z + echo 'dependencies={}' 2025-12-04T11:10:45.2160630Z ##[group]Run set -eux 2025-12-04T11:10:45.2160903Z set -eux 2025-12-04T11:10:45.2161117Z  2025-12-04T11:10:45.2161332Z if [[ -n "" ]]; then 2025-12-04T11:10:45.2161595Z  source "" 2025-12-04T11:10:45.2161826Z fi 2025-12-04T11:10:45.2162036Z  2025-12-04T11:10:45.2162296Z if [[ ! -d "${BENCHMARK_RESULTS_DIR}" ]]; then 2025-12-04T11:10:45.2162705Z  echo "${BENCHMARK_RESULTS_DIR} does not exist, skipping" 2025-12-04T11:10:45.2163169Z  # We don't want the job to fail if the directory doesn't exist 2025-12-04T11:10:45.2163538Z  exit 0 2025-12-04T11:10:45.2163777Z fi 2025-12-04T11:10:45.2164013Z  2025-12-04T11:10:45.2164249Z if [[ "${DRY_RUN}" == "true" ]]; then 2025-12-04T11:10:45.2164700Z  python3 "${GITHUB_ACTION_PATH}/../../scripts/upload_benchmark_results.py" \ 2025-12-04T11:10:45.2165230Z  --benchmark-results-dir "${BENCHMARK_RESULTS_DIR}" \ 2025-12-04T11:10:45.2165649Z  --metadata "${BENCHMARK_METADATA}" \ 2025-12-04T11:10:45.2165989Z  --runners "${RUNNER_INFO}" \ 2025-12-04T11:10:45.2166322Z  --dependencies "${DEPENDENCIES}" \ 2025-12-04T11:10:45.2166754Z  --dry-run 2025-12-04T11:10:45.2166994Z else 2025-12-04T11:10:45.2167356Z  python3 "${GITHUB_ACTION_PATH}/../../scripts/upload_benchmark_results.py" \ 2025-12-04T11:10:45.2167869Z  --benchmark-results-dir "${BENCHMARK_RESULTS_DIR}" \ 2025-12-04T11:10:45.2168267Z  --metadata "${BENCHMARK_METADATA}" \ 2025-12-04T11:10:45.2168611Z  --runners "${RUNNER_INFO}" \ 2025-12-04T11:10:45.2168932Z  --dependencies "${DEPENDENCIES}" 2025-12-04T11:10:45.2169238Z fi 2025-12-04T11:10:45.2177534Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:45.2177892Z env: 2025-12-04T11:10:45.2178094Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:45.2178351Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:45.2178651Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:45.2179165Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:45.2179635Z DEVICE_NAME: 2025-12-04T11:10:45.2179850Z DEVICE_TYPE: 2025-12-04T11:10:45.2180092Z BENCHMARK_RESULTS_DIR: test/test-reports 2025-12-04T11:10:45.2180386Z DRY_RUN: false 2025-12-04T11:10:45.2181857Z BENCHMARK_METADATA: {"timestamp": 1764846643, "schema_version": "v3", "name": "linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests)", "repo": "pytorch/pytorch", "head_branch": "refs/heads/main", "head_sha": "ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32", "workflow_id": 19922826259, "run_attempt": 1, "job_id": 57118183207} 2025-12-04T11:10:45.2183940Z RUNNER_INFO: [{"cpu_info": "x86_64", "cpu_count": 16, "avail_mem_in_gb": 62, "extra_info": {"hostname": "ip-10-0-37-220.ec2.internal"}, "name": "cuda", "type": "NVIDIA A10G", "gpu_count": 1, "avail_gpu_mem_in_gb": 22}] 2025-12-04T11:10:45.2184692Z DEPENDENCIES: {} 2025-12-04T11:10:45.2184912Z ##[endgroup] 2025-12-04T11:10:45.2213680Z + [[ -n '' ]] 2025-12-04T11:10:45.2213953Z + [[ ! -d test/test-reports ]] 2025-12-04T11:10:45.2214227Z + [[ false == \t\r\u\e ]] 2025-12-04T11:10:45.2217353Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py --benchmark-results-dir test/test-reports --metadata '{"timestamp": 1764846643, "schema_version": "v3", "name": "linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests)", "repo": "pytorch/pytorch", "head_branch": "refs/heads/main", "head_sha": "ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32", "workflow_id": 19922826259, "run_attempt": 1, "job_id": 57118183207}' --runners '[{"cpu_info": "x86_64", "cpu_count": 16, "avail_mem_in_gb": 62, "extra_info": {"hostname": "ip-10-0-37-220.ec2.internal"}, "name": "cuda", "type": "NVIDIA A10G", "gpu_count": 1, "avail_gpu_mem_in_gb": 22}]' --dependencies '{}' 2025-12-04T11:10:45.3818604Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'inductor/test_aot_inductor'}, {'test_file': 'inductor/test_torchinductor'}, {'test_file': 'inductor/test_torchinductor_dynamic_shapes'}, {'test_file': 'inductor/test_torchinductor_codegen_dynamic_shapes'}, {'test_file': 'inductor/test_kernel_benchmark'}, {'test_file': 'inductor/test_torchinductor_opinfo'}, {'test_file': 'inductor/test_pattern_matcher'}, {'test_file': 'inductor/test_cuda_repro'}, {'test_file': 'inductor/test_cudagraph_trees'}, {'test_file': 'dynamo/test_activation_checkpointing'}, {'test_file': 'dynamo/test_logging'}, {'test_file': 'dynamo/test_repros'}, {'test_file': 'inductor/test_flex_attention'}, {'test_file': 'inductor/test_cuda_select_algorithm'}, {'test_file': 'inductor/test_compile_subprocess'}, {'test_file': 'inductor/test_flex_decoding'}, {'test_file': 'inductor/test_deterministic'}, {'test_file': 'export/test_retraceability'}, {'test_file': 'inductor/test_fp8'}, {'test_file': 'dynamo/test_model_output'}, {'test_file': 'inductor/test_triton_kernels'}, {'test_file': 'inductor/test_extension_backend'}, {'test_file': 'inductor/test_native_matmul'}, {'test_file': 'inductor/test_loop_ordering'}, {'test_file': 'export/test_serdes'}, {'test_file': 'dynamo/test_regional_inductor'}, {'test_file': 'dynamo/test_fx_graph_runnable'}, {'test_file': 'dynamo/test_backends'}, {'test_file': 'inductor/test_aot_inductor_package'}, {'test_file': 'inductor/test_decompose_mem_bound_mm'}, {'test_file': 'inductor/test_op_dtype_prop'}, {'test_file': 'inductor/test_online_softmax'}, {'test_file': 'inductor/test_memory'}, {'test_file': 'dynamo/test_streams'}, {'test_file': 'inductor/test_unbacked_symints'}, {'test_file': 'inductor/test_scatter_optimization'}, {'test_file': 'inductor/test_mix_order_reduction'}, {'test_file': 'inductor/test_padding'}, {'test_file': 'dynamo/test_aot_compile'}, {'test_file': 'dynamo/test_sets'}, {'test_file': 'dynamo/test_wrap_inductor_compiled_regions'}, {'test_file': 'dynamo/test_callback'}, {'test_file': 'dynamo/test_compiler_bisector'}, {'test_file': 'inductor/test_custom_op_autotune'}, {'test_file': 'inductor/test_cudagraph_trees_expandable_segments'}, {'test_file': 'dynamo/test_decorators'}, {'test_file': 'test_privateuseone_python_backend'}, {'test_file': 'inductor/test_collective_autotuning'}, {'test_file': 'test_varlen_attention'}, {'test_file': 'test_cuda'}, {'test_file': 'test_transformers'}, {'test_file': 'test_autograd'}, {'test_file': 'test_sparse'}, {'test_file': 'higher_order_ops/test_local_map'}, {'test_file': 'test_dataloader'}, {'test_file': 'higher_order_ops/test_invoke_subgraph'}, {'test_file': 'test_ci_sanity_check_fail'}, {'test_file': 'test_ops_fwd_gradients'}, {'test_file': 'test_ops_gradients'}, {'test_file': 'test_nestedtensor'}, {'test_file': 'test_linalg'}, {'test_file': 'test_cuda_expandable_segments'}, {'test_file': 'test_public_bindings'}, {'test_file': 'functorch/test_dims'}, {'test_file': 'test_sparse_csr'}, {'test_file': 'functorch/test_ops'}, {'test_file': 'functorch/test_vmap'}, {'test_file': 'test_overrides'}, {'test_file': 'test_torchfuzz_repros'}, {'test_file': 'inductor/test_group_batch_fusion'}, {'test_file': 'dynamo/test_dynamic_shapes'}, {'test_file': 'inductor/test_cpu_repro'}, {'test_file': 'dynamo/test_after_aot'}, {'test_file': 'inductor/test_snode_runtime'}, {'test_file': 'inductor/test_minifier'}, {'test_file': 'inductor/test_compiled_autograd'}, {'test_file': 'inductor/test_custom_lowering'}, {'test_file': 'inductor/test_perf'}, {'test_file': 'inductor/test_fused_attention'}, {'test_file': 'inductor/test_binary_folding'}, {'test_file': 'inductor/test_mkldnn_pattern_matcher'}, {'test_file': 'inductor/test_inductor_freezing'}, {'test_file': 'inductor/test_layout_optim'}, {'test_file': 'dynamo/test_unspec'}, {'test_file': 'dynamo/test_higher_order_ops'}, {'test_file': 'inductor/test_mmdecomp'}, {'test_file': 'dynamo/test_ctx_manager'}, {'test_file': 'dynamo/test_exc'}, {'test_file': 'dynamo/test_misc'}, {'test_file': 'inductor/test_cpu_select_algorithm'}, {'test_file': 'inductor/test_aot_inductor_arrayref'}, {'test_file': 'inductor/test_cpu_cpp_wrapper'}, {'test_file': 'inductor/test_triton_cpu_backend'}, {'test_file': 'inductor/test_torchinductor_strided_blocks'}, {'test_file': 'test_custom_ops'}, {'test_file': 'test_content_store'}, {'test_file': 'inductor/test_halide'}, {'test_file': 'inductor/test_multi_kernel'}, {'test_file': 'inductor/test_analysis'}, {'test_file': 'inductor/test_pad_mm'}, {'test_file': 'inductor/test_triton_syntax'}, {'test_file': 'inductor/test_triton_extension_backend'}, {'test_file': 'test_sparse_semi_structured'}, {'test_file': 'inductor/test_op_completeness'}, {'test_file': 'inductor/test_subgraph_choice'}, {'test_file': 'inductor/test_b2b_gemm'}, {'test_file': 'inductor/test_triton_heuristics'}, {'test_file': 'inductor/test_cutedsl_grouped_mm'}, {'test_file': 'inductor/test_cpp_wrapper_hipify'}, {'test_file': 'inductor/test_ck_backend'}, {'test_file': 'inductor/test_inductor_utils'}, {'test_file': 'inductor/test_template_heuristics_registry'}, {'test_file': 'inductor/test_async_compile'}, {'test_file': 'inductor/test_gpu_cpp_wrapper'}, {'test_file': 'export/test_export_training_ir_to_run_decomp'}, {'test_file': 'dynamo/test_deque_reconstruct'}, {'test_file': 'inductor/test_utils'}, {'test_file': 'inductor/test_indexing'}, {'test_file': 'inductor/test_inductor_annotations'}, {'test_file': 'inductor/test_compile_worker'}, {'test_file': 'dynamo/test_einops'}, {'test_file': 'inductor/test_external_callables'}, {'test_file': 'test_testing'}, {'test_file': 'dynamo/test_fx_passes_pre_grad'}, {'test_file': 'inductor/test_autoheuristic'}, {'test_file': 'export/test_strict_export_v2'}, {'test_file': 'inductor/test_flex_flash'}, {'test_file': 'inductor/test_segmented_tree'}, {'test_file': 'inductor/test_kernel_optimization'}, {'test_file': 'inductor/test_metrics'}, {'test_file': 'export/test_unflatten_training_ir'}, {'test_file': 'inductor/test_fx_fusion'}, {'test_file': 'inductor/test_xpu_basic'}, {'test_file': 'dynamo/test_inline_and_install'}, {'test_file': 'export/test_functionalized_assertions'}, {'test_file': 'inductor/test_selective_lowering'}, {'test_file': 'dynamo/test_base_output'}, {'test_file': 'inductor/test_lookup_table'}, {'test_file': 'inductor/test_cooperative_reductions'}, {'test_file': 'export/test_serialize'}, {'test_file': 'inductor/test_cutedsl_template'}, {'test_file': 'inductor/test_benchmark_fusion'}, {'test_file': 'inductor/test_inductor_scheduler'}, {'test_file': 'inductor/test_move_constructors_to_gpu'}, {'test_file': 'export/test_export_strict'}, {'test_file': 'dynamo/test_modules'}, {'test_file': 'inductor/test_remote_cache'}, {'test_file': 'inductor/test_coordinate_descent_tuner'}, {'test_file': 'inductor/test_inplace_padding'}, {'test_file': 'inductor/test_cudacodecache'}, {'test_file': 'inductor/test_minifier_utils'}, {'test_file': 'inductor/test_debug_trace'}, {'test_file': 'dynamo/test_recompiles'}, {'test_file': 'inductor/test_foreach'}, {'test_file': 'export/test_tree_utils'}, {'test_file': 'inductor/test_triton_wrapper'}, {'test_file': 'inductor/test_static_cuda_launcher'}, {'test_file': 'export/test_dynamic_shapes'}, {'test_file': 'dynamo/test_sdpa'}, {'test_file': 'dynamo/test_utils'}, {'test_file': 'inductor/test_provenance_tracing'}, {'test_file': 'inductor/test_combo_kernels'}, {'test_file': 'inductor/test_codegen_triton'}, {'test_file': 'dynamo/test_frame_init'}, {'test_file': 'inductor/test_device_assert'}, {'test_file': 'dynamo/test_skip_non_tensor'}, {'test_file': 'dynamo/test_skip_guard_eval_unsafe'}, {'test_file': 'dynamo/test_interop'}, {'test_file': 'functorch/test_eager_transforms'}, {'test_file': 'inductor/test_control_deps'}, {'test_file': 'inductor/test_benchmarking'}, {'test_file': 'inductor/test_helion_kernels'}, {'test_file': 'inductor/test_quantization'}, {'test_file': 'inductor/test_best_config'}, {'test_file': 'export/test_tools'}, {'test_file': 'dynamo/test_buffers_override'}, {'test_file': 'inductor/test_inplacing_pass'}, {'test_file': 'inductor/test_aot_inductor_custom_ops'}, {'test_file': 'inductor/test_split_cat_fx_passes'}, {'test_file': 'inductor/test_profiler'}, {'test_file': 'inductor/test_memory_planning'}, {'test_file': 'inductor/test_mem_estimation'}, {'test_file': 'dynamo/test_view'}, {'test_file': 'inductor/test_cutlass_evt'}, {'test_file': 'dynamo/test_reconstruct'}, {'test_file': 'dynamo/test_aot_autograd'}, {'test_file': 'export/test_cpp_serdes'}, {'test_file': 'inductor/test_cache'}, {'test_file': 'inductor/test_block_analysis'}, {'test_file': 'dynamo/test_subgraphs'}, {'test_file': 'dynamo/test_pre_dispatch'}, {'test_file': 'inductor/test_custom_post_grad_passes'}, {'test_file': 'dynamo/test_fx_annotate'}, {'test_file': 'dynamo/test_pgo'}, {'test_file': 'dynamo/test_config'}, {'test_file': 'dynamo/test_metrics_context'}, {'test_file': 'export/test_package'}, {'test_file': 'export/test_export_opinfo'}, {'test_file': 'dynamo/test_nops'}, {'test_file': 'inductor/test_graph_transform_observer'}, {'test_file': 'inductor/test_aot_inductor_utils'}, {'test_file': 'export/test_db'}, {'test_file': 'dynamo/test_export_mutations'}, {'test_file': 'inductor/test_config'}, {'test_file': 'inductor/test_dependencies'}, {'test_file': 'inductor/test_fuzzer'}, {'test_file': 'dynamo/test_global'}, {'test_file': 'inductor/test_control_flow'}, {'test_file': 'dynamo/test_graph_region_tracker'}, {'test_file': 'dynamo/test_unittest'}, {'test_file': 'inductor/test_compile'}, {'test_file': 'dynamo/test_functions'}, {'test_file': 'inductor/test_ordered_set'}, {'test_file': 'inductor/test_pallas'}, {'test_file': 'dynamo/test_install_free_tensors'}, {'test_file': 'inductor/test_torchinductor_codegen_config_overrides'}, {'test_file': 'export/test_passes'}, {'test_file': 'dynamo/test_autograd_function'}, {'test_file': 'inductor/test_codecache'}, {'test_file': 'dynamo/test_cudagraphs'}, {'test_file': 'inductor/test_alignment'}, {'test_file': 'dynamo/test_profiler'}, {'test_file': 'dynamo/test_guard_serialization'}, {'test_file': 'dynamo/test_compile'}, {'test_file': 'dynamo/test_nested_graph_breaks'}, {'test_file': 'dynamo/test_dicts'}, {'test_file': 'inductor/test_needs_exact_strides'}, {'test_file': 'inductor/test_auto_functionalize'}, {'test_file': 'inductor/test_split_cat_fx_aten_passes'}, {'test_file': 'inductor/test_minifier_isolate'}, {'test_file': 'dynamo/test_list'}, {'test_file': 'dynamo/test_resume'}, {'test_file': 'inductor/test_augmented_graph_helper'}, {'test_file': 'dynamo/test_deviceguard'}, {'test_file': 'dynamo/test_sources'}, {'test_file': 'dynamo/test_backward_higher_order_ops'}, {'test_file': 'dynamo/test_modes'}, {'test_file': 'dynamo/test_optimizers'}, {'test_file': 'export/test_torchbind'}, {'test_file': 'inductor/test_custom_partitioner_fn'}, {'test_file': 'dynamo/test_debug_utils'}, {'test_file': 'dynamo/test_base_hop'}, {'test_file': 'dynamo/test_export'}, {'test_file': 'dynamo/test_package'}, {'test_file': 'inductor/test_efficient_conv_bn_eval'}, {'test_file': 'inductor/test_torchbind'}, {'test_file': 'dynamo/test_python_dispatcher'}, {'test_file': 'export/test_swap'}, {'test_file': 'export/test_unflatten'}, {'test_file': 'dynamo/test_verify_correctness'}, {'test_file': 'inductor/test_fxir_backend'}, {'test_file': 'dynamo/test_cudagraphs_expandable_segments'}, {'test_file': 'inductor/test_caching'}, {'test_file': 'dynamo/test_aot_autograd_cache'}, {'test_file': 'dynamo/test_flat_apply'}, {'test_file': 'dynamo/test_input_attr_tracking'}, {'test_file': 'dynamo/test_graph_deduplication'}, {'test_file': 'inductor/test_distributed_patterns'}, {'test_file': 'dynamo/test_structured_trace'}, {'test_file': 'dynamo/test_error_messages'}, {'test_file': 'dynamo/test_bytecode_utils'}, {'test_file': 'dynamo/test_fake_distributed'}, {'test_file': 'inductor/test_mps_basic'}, {'test_file': 'export/test_nativert'}, {'test_file': 'export/test_hop'}, {'test_file': 'dynamo/test_tree_map'}, {'test_file': 'dynamo/test_minifier'}, {'test_file': 'dynamo/test_guard_manager'}, {'test_file': 'export/test_schema'}, {'test_file': 'dynamo/test_torchrec'}, {'test_file': 'export/test_pass_infra'}, {'test_file': 'export/test_experimental'}, {'test_file': 'export/test_converter'}, {'test_file': 'export/test_export'}, {'test_file': 'test_model_exports_to_core_aten'}, {'test_file': 'dynamo/test_precompile_context'}, {'test_file': 'dynamo/test_trace_rules'}, {'test_file': 'export/test_upgrader'}, {'test_file': 'dynamo/test_hooks'}, {'test_file': 'dynamo/test_reorder_logs'}, {'test_file': 'dynamo/test_subclasses'}, {'test_file': 'dynamo/test_exceptions'}, {'test_file': 'dynamo/test_generator'}, {'test_file': 'export/test_lift_unlift'}, {'test_file': 'export/test_verifier'}, {'test_file': 'export/test_sparse'}, {'test_file': 'dynamo/test_python_autograd'}, {'test_file': 'export/test_draft_export'}, {'test_file': 'dynamo/test_comptime'}, {'test_file': 'test_sort_and_select'}, {'test_file': 'functorch/test_rearrange'}, {'test_file': 'functorch/test_parsing'}, {'test_file': 'test_package'}, {'test_file': 'profiler/test_profiler'}, {'test_file': 'test_mkl_verbose'}, {'test_file': 'test_comparison_utils'}, {'test_file': 'functorch/test_ac_logging'}, {'test_file': 'test_mkldnn_verbose'}, {'test_file': 'test_cpp_api_parity'}, {'test_file': 'test_utils_config_module'}, {'test_file': 'test_hop_infra'}, {'test_file': 'test_appending_byte_serializer'}, {'test_file': 'test_license'}, {'test_file': 'test_ao_sparsity'}, {'test_file': 'test_autoload'}, {'test_file': 'nn/attention/test_open_registry'}, {'test_file': 'xpu/test_fusion'}, {'test_file': 'test_as_strided'}, {'test_file': 'test_foreach'}, {'test_file': 'test_proxy_tensor'}, {'test_file': 'torch_np/test_binary_ufuncs'}, {'test_file': 'torch_np/test_unary_ufuncs'}, {'test_file': 'test_utils_filelock'}, {'test_file': 'test_extension_utils'}, {'test_file': 'test_rename_privateuse1_to_existing_device'}, {'test_file': 'nn/attention/test_fa4'}, {'test_file': 'typing/test_python_operators'}, {'test_file': 'test_functionalization'}, {'test_file': 'torch_np/test_dtype'}, {'test_file': 'test_file_check'}, {'test_file': 'profiler/test_kineto'}, {'test_file': 'test_flop_counter'}, {'test_file': 'backends/xeon/test_launch'}, {'test_file': 'test_show_pickle'}, {'test_file': 'test_openmp'}, {'test_file': 'test_expanded_weights'}, {'test_file': 'test_module_tracker'}, {'test_file': 'torch_np/numpy_tests/core/test_scalarinherit'}, {'test_file': 'test_tensorexpr_pybind'}, {'test_file': 'test_fx_experimental'}, {'test_file': 'functorch/test_ac_knapsack'}, {'test_file': 'torch_np/test_nep50_examples'}, {'test_file': 'test_torch'}, {'test_file': 'xpu/test_gemm'}, {'test_file': 'test_fx_passes'}, {'test_file': 'functorch/test_logging'}, {'test_file': 'test_namedtensor'}, {'test_file': 'test_tensorexpr'}, {'test_file': 'functorch/test_minifier'}, {'test_file': 'higher_order_ops/test_invoke_quant'}, {'test_file': 'torch_np/test_basic'}, {'test_file': 'test_jiterator'}, {'test_file': 'test_native_functions'}, {'test_file': 'test_typing'}, {'test_file': 'higher_order_ops/test_with_effects'}, {'test_file': 'test_weak'}, {'test_file': 'test_complex'}, {'test_file': 'test_optim'}, {'test_file': 'lazy/test_functionalization'}, {'test_file': 'torch_np/test_random'}, {'test_file': 'nn/test_multihead_attention'}, {'test_file': 'test_legacy_vmap'}, {'test_file': 'lazy/test_bindings'}, {'test_file': 'xpu/test_conv'}, {'test_file': 'test_utils'}, {'test_file': 'test_pytree'}, {'test_file': 'test_namedtuple_return_api'}, {'test_file': 'profiler/test_record_function'}, {'test_file': 'test_compile_benchmark_util'}, {'test_file': 'test_set_default_mobile_cpu_allocator'}, {'test_file': 'test_fake_tensor'}, {'test_file': 'test_stateless'}, {'test_file': 'functorch/test_ac'}, {'test_file': 'test_binary_ufuncs'}, {'test_file': 'higher_order_ops/test_print'}, {'test_file': 'test_per_overload_api'}, {'test_file': 'torch_np/numpy_tests/core/test_einsum'}, {'test_file': 'test_multiprocessing'}, {'test_file': 'test_out_dtype_op'}, {'test_file': 'torch_np/test_ufuncs_basic'}, {'test_file': 'lazy/test_step_closures'}, {'test_file': 'functorch/dim/test_getsetitem'}, {'test_file': 'test_numpy_interop'}, {'test_file': 'profiler/test_cpp_thread'}, {'test_file': 'test_segment_reductions'}, {'test_file': 'test_opaque_obj_v2'}, {'test_file': 'test_autograd_fallback'}, {'test_file': 'test_type_hints'}, {'test_file': 'functorch/test_aot_joint_with_descriptors'}, {'test_file': 'test_functionalization_of_rng_ops'}, {'test_file': 'test_fx_reinplace_pass'}, {'test_file': 'functorch/test_control_flow'}, {'test_file': 'test_modules'}, {'test_file': 'nn/test_packed_sequence'}, {'test_file': 'test_numa_binding'}, {'test_file': 'test_pruning_op'}, {'test_file': 'test_jit_fuser_te'}, {'test_file': 'test_autocast'}, {'test_file': 'test_logging'}, {'test_file': 'test_python_dispatch'}, {'test_file': 'nn/test_lazy_modules'}, {'test_file': 'nn/test_pruning'}, {'test_file': 'test_monitor'}, {'test_file': 'test_cuda_sanitizer'}, {'test_file': 'test_bundled_inputs'}, {'test_file': 'torch_np/numpy_tests/core/test_numeric'}, {'test_file': 'torch_np/numpy_tests/core/test_multiarray'}, {'test_file': 'test_itt'}, {'test_file': 'torch_np/numpy_tests/lib/test_function_base'}, {'test_file': 'test_masked'}, {'test_file': 'test_sympy_utils'}, {'test_file': 'test_jit_disabled'}, {'test_file': 'test_subclass'}, {'test_file': 'test_import_stats'}, {'test_file': 'functorch/test_vmap_registrations'}, {'test_file': 'nn/test_parametrization'}, {'test_file': 'test_mkldnn_fusion'}, {'test_file': 'test_cpp_extensions_mtia_backend'}, {'test_file': 'lazy/test_ts_opinfo'}, {'test_file': 'test_dynamic_shapes'}, {'test_file': 'complex_tensor/test_complex_tensor'}, {'test_file': 'optim/test_lrscheduler'}, {'test_file': 'optim/test_swa_utils'}, {'test_file': 'cpp_extensions/python_agnostic_extension/test/test_python_agnostic'}, {'test_file': 'functorch/test_memory_efficient_fusion'}, {'test_file': 'torch_np/numpy_tests/lib/test_histograms'}, {'test_file': 'torch_np/test_indexing'}, {'test_file': 'test_schema_check'}, {'test_file': 'test_tensorboard'}, {'test_file': 'torch_np/numpy_tests/core/test_indexing'}, {'test_file': 'test_futures'}, {'test_file': 'test_tensor_creation_ops'}, {'test_file': 'nn/test_dropout'}, {'test_file': 'functorch/dim/test_split'}, {'test_file': 'torch_np/numpy_tests/lib/test_type_check'}, {'test_file': 'cpp_extensions/test_libtorch_agnostic'}, {'test_file': 'test_cpp_extensions_stream_and_event'}, {'test_file': 'profiler/test_execution_trace'}, {'test_file': 'test_dispatch'}, {'test_file': 'test_datapipe'}, {'test_file': 'test_numba_integration'}, {'test_file': 'test_functional_optim'}, {'test_file': 'test_maskedtensor'}, {'test_file': 'benchmark_utils/test_benchmark_utils'}, {'test_file': 'torch_np/numpy_tests/core/test_scalarmath'}, {'test_file': 'test_scaled_matmul_cuda'}, {'test_file': 'torch_np/numpy_tests/core/test_shape_base'}, {'test_file': 'test_vulkan'}, {'test_file': 'lazy/test_generator'}, {'test_file': 'nn/test_convolution'}, {'test_file': 'torch_np/numpy_tests/linalg/test_linalg'}, {'test_file': 'torch_np/numpy_tests/core/test_dtype'}, {'test_file': 'lazy/test_debug_util'}, {'test_file': 'nn/test_load_state_dict'}, {'test_file': 'test_shape_ops'}, {'test_file': 'nn/test_module_hooks'}, {'test_file': 'torch_np/numpy_tests/lib/test_twodim_base'}, {'test_file': 'profiler/test_memory_profiler'}, {'test_file': 'test_jit_llga_fuser'}, {'test_file': 'test_serialization'}, {'test_file': 'optim/test_optim'}, {'test_file': 'test_indexing'}, {'test_file': 'torch_np/numpy_tests/fft/test_pocketfft'}, {'test_file': 'torch_np/numpy_tests/lib/test_shape_base_'}, {'test_file': 'torch_np/numpy_tests/core/test_getlimits'}, {'test_file': 'torch_np/test_ndarray_methods'}, {'test_file': 'test_view_ops'}, {'test_file': 'test_type_info'}, {'test_file': 'functorch/test_aotdispatch'}, {'test_file': 'test_nn'}, {'test_file': 'torch_np/numpy_tests/core/test_dlpack'}, {'test_file': 'test_multiprocessing_spawn'}, {'test_file': 'test_scatter_gather_ops'}, {'test_file': 'test_cuda_multigpu'}, {'test_file': 'test_mkldnn'}, {'test_file': 'torch_np/numpy_tests/lib/test_index_tricks'}, {'test_file': 'test_jit_autocast'}, {'test_file': 'nn/test_pooling'}, {'test_file': 'nn/test_embedding'}, {'test_file': 'test_unary_ufuncs'}, {'test_file': 'test_xnnpack_integration'}, {'test_file': 'test_cuda_trace'}, {'test_file': 'test_native_mha'}, {'test_file': 'torch_np/numpy_tests/core/test_numerictypes'}, {'test_file': 'test_cuda_nvml_based_avail'}, {'test_file': 'test_function_schema'}, {'test_file': 'test_accelerator'}, {'test_file': 'nn/test_init'}, {'test_file': 'torch_np/numpy_tests/core/test_scalar_methods'}, {'test_file': 'torch_np/numpy_tests/fft/test_helper'}, {'test_file': 'test_mobile_optimizer'}, {'test_file': 'torch_np/test_function_base'}, {'test_file': 'test_type_promotion'}, {'test_file': 'torch_np/test_scalars_0D_arrays'}, {'test_file': 'test_cuda_primary_ctx'}, {'test_file': 'profiler/test_profiler_tree'}, {'test_file': 'torch_np/numpy_tests/lib/test_arraysetops'}, {'test_file': 'test_dlpack'}, {'test_file': 'profiler/test_torch_tidy'}, {'test_file': 'lazy/test_reuse_ir'}, {'test_file': 'test_functional_autograd_benchmark'}, {'test_file': 'test_reductions'}, {'test_file': 'torch_np/test_reductions'}, {'test_file': 'torch_np/numpy_tests/core/test_scalar_ctors'}, {'test_file': 'torch_np/numpy_tests/lib/test_arraypad'}, {'test_file': 'test_prims'}, {'test_file': 'test_spectral_ops'}, {'test_file': 'profiler/test_python_tracer'}, {'test_file': 'cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility'}, {'test_file': 'distributions/test_distributions'}, {'test_file': 'test_autoload_disable'}, {'test_file': 'test_autoload_enable'}, {'test_file': 'test_cpp_extensions_aot_ninja'}, {'test_file': 'test_cpp_extensions_aot_no_ninja'}], 'excluded': []} from test/test-reports/td_exclusions-ea2e7c72298c8a362420.json is not a benchmark record, skipping 2025-12-04T11:10:45.3883144Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T11:10:45.3982356Z ##[group]Run cat test/**/*_toprint.log || true 2025-12-04T11:10:45.3982739Z cat test/**/*_toprint.log || true 2025-12-04T11:10:45.3991889Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:45.3992250Z env: 2025-12-04T11:10:45.3992465Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:45.3992723Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:45.3993025Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:45.3993583Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:45.3994172Z DEVICE_NAME: 2025-12-04T11:10:45.3994448Z DEVICE_TYPE: 2025-12-04T11:10:45.3994760Z ##[endgroup] 2025-12-04T11:10:45.4110240Z cat: 'test/**/*_toprint.log': No such file or directory 2025-12-04T11:10:45.4141576Z ##[group]Run kill "$MONITOR_SCRIPT_PID" 2025-12-04T11:10:45.4141928Z kill "$MONITOR_SCRIPT_PID" 2025-12-04T11:10:45.4150314Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:45.4150669Z env: 2025-12-04T11:10:45.4150876Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:45.4151133Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:45.4151426Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:45.4151951Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:45.4152413Z DEVICE_NAME: 2025-12-04T11:10:45.4152631Z DEVICE_TYPE: 2025-12-04T11:10:45.4152853Z MONITOR_SCRIPT_PID: 59428 2025-12-04T11:10:45.4153111Z ##[endgroup] 2025-12-04T11:10:45.4182480Z /home/ec2-user/actions-runner/_work/_temp/b2bb3ba8-615b-4f5e-9fbc-39d3a4576c83.sh: line 1: kill: (59428) - No such process 2025-12-04T11:10:45.4193263Z ##[error]Process completed with exit code 1. 2025-12-04T11:10:45.4323141Z Prepare all required actions 2025-12-04T11:10:45.4323533Z Getting action download info 2025-12-04T11:10:45.6276544Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T11:10:46.2936934Z Download action repository 'actions/upload-artifact@v4' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T11:10:48.3775163Z ##[group]Run ./.github/actions/upload-test-artifacts 2025-12-04T11:10:48.3775504Z with: 2025-12-04T11:10:48.3775850Z file-suffix: test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T11:10:48.3776279Z s3-bucket: gha-artifacts 2025-12-04T11:10:48.3776521Z env: 2025-12-04T11:10:48.3776717Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:48.3776961Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:48.3777265Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:48.3777773Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:48.3778264Z DEVICE_NAME: 2025-12-04T11:10:48.3778481Z DEVICE_TYPE: 2025-12-04T11:10:48.3778694Z ##[endgroup] 2025-12-04T11:10:48.3838864Z ##[group]Run # Remove any previous test jsons if they exist 2025-12-04T11:10:48.3839301Z # Remove any previous test jsons if they exist 2025-12-04T11:10:48.3839735Z rm -f test-jsons-*.zip 2025-12-04T11:10:48.3840143Z zip -r "test-jsons-${FILE_SUFFIX}.zip" test/test-reports -i '*.json' 2025-12-04T11:10:48.3849406Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:48.3849764Z env: 2025-12-04T11:10:48.3849978Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:48.3850243Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:48.3850545Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:48.3851083Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:48.3851558Z DEVICE_NAME: 2025-12-04T11:10:48.3851774Z DEVICE_TYPE: 2025-12-04T11:10:48.3852136Z FILE_SUFFIX: test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T11:10:48.3852541Z ##[endgroup] 2025-12-04T11:10:48.4592042Z adding: test/test-reports/td_exclusions-ea2e7c72298c8a362420.json (deflated 82%) 2025-12-04T11:10:48.4619444Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-33c1f9cf025a3215.json (deflated 99%) 2025-12-04T11:10:48.4620642Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_dynamic_shapes/inductor.test_torchinductor_codegen_dynamic_shapes-714bf3905702b92c.json (stored 0%) 2025-12-04T11:10:48.4621911Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-0945b0cce338d2d9.json (deflated 98%) 2025-12-04T11:10:48.5837706Z adding: test/test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-c842e470cbb98a3c.json (deflated 96%) 2025-12-04T11:10:48.5843880Z adding: test/test-reports/python-pytest/inductor.test_cuda_repro/inductor.test_cuda_repro-e99e0f6eb81b07f7.json (deflated 99%) 2025-12-04T11:10:48.5844953Z adding: test/test-reports/python-pytest/dynamo.test_activation_checkpointing/dynamo.test_activation_checkpointing-38da9e9ed6d54bce.json (stored 0%) 2025-12-04T11:10:48.5845989Z adding: test/test-reports/python-pytest/dynamo.test_logging/dynamo.test_logging-f357a72c7cdbf6f8.json (stored 0%) 2025-12-04T11:10:48.5938245Z adding: test/test-reports/python-pytest/dynamo.test_repros/dynamo.test_repros-9531739f45e08308.json (deflated 99%) 2025-12-04T11:10:48.5939218Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-b253e91c6bffe97f.json (stored 0%) 2025-12-04T11:10:48.5940237Z adding: test/test-reports/python-pytest/inductor.test_flex_decoding/inductor.test_flex_decoding-718ee71e5383cf4c.json (stored 0%) 2025-12-04T11:10:48.5996997Z adding: test/test-reports/python-pytest/dynamo.test_fx_graph_runnable/dynamo.test_fx_graph_runnable-51563ef4cc34da7e.json (deflated 99%) 2025-12-04T11:10:48.5999080Z adding: test/test-reports/python-pytest/inductor.test_online_softmax/inductor.test_online_softmax-f797910239038b77.json (stored 0%) 2025-12-04T11:10:48.6001349Z adding: test/test-reports/python-pytest/inductor.test_memory/inductor.test_memory-380c528ed363230b.json (stored 0%) 2025-12-04T11:10:48.6003102Z adding: test/test-reports/python-pytest/dynamo.test_streams/dynamo.test_streams-8a8793e6ce17b0e2.json (stored 0%) 2025-12-04T11:10:48.6004586Z adding: test/test-reports/python-pytest/inductor.test_unbacked_symints/inductor.test_unbacked_symints-dfbf01aaa57bc123.json (stored 0%) 2025-12-04T11:10:48.6005568Z adding: test/test-reports/python-pytest/dynamo.test_aot_compile/dynamo.test_aot_compile-f9d7f69515583290.json (stored 0%) 2025-12-04T11:10:48.6006575Z adding: test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-7cc4530ad1782b86.json (stored 0%) 2025-12-04T11:10:48.6007576Z adding: test/test-reports/python-pytest/test_varlen_attention/test_varlen_attention-ad0d22c525c989e0.json (stored 0%) 2025-12-04T11:10:48.6008412Z adding: test/test-reports/python-pytest/test_autograd/test_autograd-024c4c739b0a1a21.json (deflated 99%) 2025-12-04T11:10:48.6009272Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-b1f2319d42b89977.json (deflated 98%) 2025-12-04T11:10:48.6010152Z adding: test/test-reports/python-pytest/test_ops_gradients/test_ops_gradients-835aa3d27e77a0f6.json (stored 0%) 2025-12-04T11:10:48.6011087Z adding: test/test-reports/python-pytest/test_nestedtensor/test_nestedtensor-295bc14f77bb0dc7.json (stored 0%) 2025-12-04T11:10:48.6011916Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-0efbb8a89000c5fe.json (deflated 98%) 2025-12-04T11:10:48.6014298Z adding: test/test-reports/python-pytest/test_overrides/test_overrides-1eb0f1d69a632627.json (deflated 99%) 2025-12-04T11:10:48.6015208Z adding: test/test-reports/python-pytest/test_torchfuzz_repros/test_torchfuzz_repros-1a411881b6ff90d5.json (stored 0%) 2025-12-04T11:10:48.6016188Z adding: test/test-reports/python-pytest/inductor.test_group_batch_fusion/inductor.test_group_batch_fusion-4467bcde0d834301.json (stored 0%) 2025-12-04T11:10:48.6017209Z adding: test/test-reports/python-pytest/dynamo.test_dynamic_shapes/dynamo.test_dynamic_shapes-479e026d710782d7.json (stored 0%) 2025-12-04T11:10:48.6018208Z adding: test/test-reports/python-pytest/inductor.test_custom_lowering/inductor.test_custom_lowering-5db5aae03b627061.json (stored 0%) 2025-12-04T11:10:48.6019168Z adding: test/test-reports/python-pytest/inductor.test_perf/inductor.test_perf-cdb6dc2507d07b9f.json (stored 0%) 2025-12-04T11:10:48.6064379Z adding: test/test-reports/python-pytest/inductor.test_mkldnn_pattern_matcher/inductor.test_mkldnn_pattern_matcher-d34f2a11a4dc09d6.json (deflated 99%) 2025-12-04T11:10:48.6066801Z adding: test/test-reports/python-pytest/dynamo.test_deque_reconstruct/dynamo.test_deque_reconstruct-a848d884f2f29080.json (stored 0%) 2025-12-04T11:10:48.6068711Z adding: test/test-reports/python-pytest/inductor.test_utils/inductor.test_utils-bd079476913ef2e3.json (stored 0%) 2025-12-04T11:10:48.6070503Z adding: test/test-reports/python-pytest/inductor.test_indexing/inductor.test_indexing-fee711b29eff954c.json (stored 0%) 2025-12-04T11:10:48.6072533Z adding: test/test-reports/python-pytest/inductor.test_inductor_annotations/inductor.test_inductor_annotations-f4480b8427e264b4.json (stored 0%) 2025-12-04T11:10:48.6074482Z adding: test/test-reports/python-pytest/inductor.test_compile_worker/inductor.test_compile_worker-3d3c1537308173d9.json (stored 0%) 2025-12-04T11:10:48.6075464Z adding: test/test-reports/python-pytest/export.test_serialize/export.test_serialize-76457f9e334c8675.json (stored 0%) 2025-12-04T11:10:48.6076392Z adding: test/test-reports/python-pytest/export.test_export_strict/export.test_export_strict-3ce7562e53da487b.json (stored 0%) 2025-12-04T11:10:48.6077528Z adding: test/test-reports/python-pytest/dynamo.test_buffers_override/dynamo.test_buffers_override-bcc16daa85498c4b.json (stored 0%) 2025-12-04T11:10:48.6078578Z adding: test/test-reports/python-pytest/inductor.test_split_cat_fx_passes/inductor.test_split_cat_fx_passes-02502e259211870e.json (stored 0%) 2025-12-04T11:10:48.6079547Z adding: test/test-reports/python-pytest/inductor.test_cache/inductor.test_cache-258ddb4609e595e9.json (stored 0%) 2025-12-04T11:10:48.6080593Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-fd83be05005cff21.json (stored 0%) 2025-12-04T11:10:48.6081597Z adding: test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-9924173763b05f6a.json (stored 0%) 2025-12-04T11:10:48.6082511Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-9bd31207a046932a.json (stored 0%) 2025-12-04T11:10:48.6083334Z adding: test/test-reports/python-pytest/test_foreach/test_foreach-57977e6bb7997bc3.json (deflated 98%) 2025-12-04T11:10:48.6084177Z adding: test/test-reports/python-pytest/nn.test_packed_sequence/nn.test_packed_sequence-dcd68b3499aab515.json (stored 0%) 2025-12-04T11:10:48.6085109Z adding: test/test-reports/python-pytest/test_numa_binding/test_numa_binding-62dfaebcfc08487c.json (stored 0%) 2025-12-04T11:10:48.6085925Z adding: test/test-reports/python-pytest/test_pruning_op/test_pruning_op-d12712c8cd1c0003.json (stored 0%) 2025-12-04T11:10:48.6086779Z adding: test/test-reports/python-pytest/test_jit_fuser_te/test_jit_fuser_te-0670cf00619133fb.json (deflated 99%) 2025-12-04T11:10:48.6087990Z adding: test/test-reports/python-pytest/torch_np.numpy_tests.core.test_indexing/torch_np.numpy_tests.core.test_indexing-f6308bf864aa318d.json (stored 0%) 2025-12-04T11:10:48.6088969Z adding: test/test-reports/python-pytest/test_futures/test_futures-eeccb63775c4add0.json (stored 0%) 2025-12-04T11:10:48.6089830Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-d21bc49881a2e2b0.json (stored 0%) 2025-12-04T11:10:48.6090773Z adding: test/test-reports/python-pytest/test_scaled_matmul_cuda/test_scaled_matmul_cuda-d9bde1d5292f9755.json (deflated 98%) 2025-12-04T11:10:48.6091837Z adding: test/test-reports/python-pytest/torch_np.numpy_tests.core.test_shape_base/torch_np.numpy_tests.core.test_shape_base-5523c0bbcadda606.json (stored 0%) 2025-12-04T11:10:48.6092793Z adding: test/test-reports/python-pytest/test_vulkan/test_vulkan-b9421a80edb6f140.json (stored 0%) 2025-12-04T11:10:48.6093605Z adding: test/test-reports/python-pytest/lazy.test_generator/lazy.test_generator-6425516431ce47a1.json (stored 0%) 2025-12-04T11:10:48.6094529Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-b82571327b265a63.json (stored 0%) 2025-12-04T11:10:48.6121532Z ##[group]Run # Remove any previous test reports if they exist 2025-12-04T11:10:48.6122060Z # Remove any previous test reports if they exist 2025-12-04T11:10:48.6122444Z rm -f test-reports-*.zip 2025-12-04T11:10:48.6122898Z zip -r "test-reports-${FILE_SUFFIX}.zip" test/test-reports -i '*.xml' -i '*.csv' 2025-12-04T11:10:48.6132085Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:48.6132453Z env: 2025-12-04T11:10:48.6132679Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:48.6132950Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:48.6133261Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:48.6133786Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:48.6134256Z DEVICE_NAME: 2025-12-04T11:10:48.6134470Z DEVICE_TYPE: 2025-12-04T11:10:48.6134871Z FILE_SUFFIX: test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T11:10:48.6135300Z ##[endgroup] 2025-12-04T11:10:48.6303599Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-33c1f9cf025a3215.xml (deflated 99%) 2025-12-04T11:10:48.6305456Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_dynamic_shapes/inductor.test_torchinductor_codegen_dynamic_shapes-714bf3905702b92c.xml (deflated 28%) 2025-12-04T11:10:48.6306898Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-0945b0cce338d2d9.xml (deflated 98%) 2025-12-04T11:10:48.7508914Z adding: test/test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-c842e470cbb98a3c.xml (deflated 96%) 2025-12-04T11:10:48.7513641Z adding: test/test-reports/python-pytest/inductor.test_cuda_repro/inductor.test_cuda_repro-e99e0f6eb81b07f7.xml (deflated 99%) 2025-12-04T11:10:48.7514997Z adding: test/test-reports/python-pytest/dynamo.test_activation_checkpointing/dynamo.test_activation_checkpointing-38da9e9ed6d54bce.xml (deflated 28%) 2025-12-04T11:10:48.7516037Z adding: test/test-reports/python-pytest/dynamo.test_logging/dynamo.test_logging-f357a72c7cdbf6f8.xml (deflated 28%) 2025-12-04T11:10:48.7604443Z adding: test/test-reports/python-pytest/dynamo.test_repros/dynamo.test_repros-9531739f45e08308.xml (deflated 99%) 2025-12-04T11:10:48.7605682Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-b253e91c6bffe97f.xml (deflated 28%) 2025-12-04T11:10:48.7606705Z adding: test/test-reports/python-pytest/inductor.test_flex_decoding/inductor.test_flex_decoding-718ee71e5383cf4c.xml (deflated 28%) 2025-12-04T11:10:48.7659084Z adding: test/test-reports/python-pytest/dynamo.test_fx_graph_runnable/dynamo.test_fx_graph_runnable-51563ef4cc34da7e.xml (deflated 99%) 2025-12-04T11:10:48.7660537Z adding: test/test-reports/python-pytest/inductor.test_online_softmax/inductor.test_online_softmax-f797910239038b77.xml (deflated 28%) 2025-12-04T11:10:48.7661844Z adding: test/test-reports/python-pytest/inductor.test_memory/inductor.test_memory-380c528ed363230b.xml (deflated 28%) 2025-12-04T11:10:48.7662745Z adding: test/test-reports/python-pytest/dynamo.test_streams/dynamo.test_streams-8a8793e6ce17b0e2.xml (deflated 28%) 2025-12-04T11:10:48.7663719Z adding: test/test-reports/python-pytest/inductor.test_unbacked_symints/inductor.test_unbacked_symints-dfbf01aaa57bc123.xml (deflated 28%) 2025-12-04T11:10:48.7664741Z adding: test/test-reports/python-pytest/dynamo.test_aot_compile/dynamo.test_aot_compile-f9d7f69515583290.xml (deflated 28%) 2025-12-04T11:10:48.7666188Z adding: test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-7cc4530ad1782b86.xml (deflated 28%) 2025-12-04T11:10:48.7667532Z adding: test/test-reports/python-pytest/test_varlen_attention/test_varlen_attention-ad0d22c525c989e0.xml (deflated 28%) 2025-12-04T11:10:48.7668550Z adding: test/test-reports/python-pytest/test_autograd/test_autograd-024c4c739b0a1a21.xml (deflated 99%) 2025-12-04T11:10:48.7669855Z adding: test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-b1f2319d42b89977.xml (deflated 98%) 2025-12-04T11:10:48.7670823Z adding: test/test-reports/python-pytest/test_ops_gradients/test_ops_gradients-835aa3d27e77a0f6.xml (deflated 28%) 2025-12-04T11:10:48.7671687Z adding: test/test-reports/python-pytest/test_nestedtensor/test_nestedtensor-295bc14f77bb0dc7.xml (deflated 28%) 2025-12-04T11:10:48.7672510Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-0efbb8a89000c5fe.xml (deflated 98%) 2025-12-04T11:10:48.7674282Z adding: test/test-reports/python-pytest/test_overrides/test_overrides-1eb0f1d69a632627.xml (deflated 99%) 2025-12-04T11:10:48.7675550Z adding: test/test-reports/python-pytest/test_torchfuzz_repros/test_torchfuzz_repros-1a411881b6ff90d5.xml (deflated 28%) 2025-12-04T11:10:48.7676887Z adding: test/test-reports/python-pytest/inductor.test_group_batch_fusion/inductor.test_group_batch_fusion-4467bcde0d834301.xml (deflated 28%) 2025-12-04T11:10:48.7677927Z adding: test/test-reports/python-pytest/dynamo.test_dynamic_shapes/dynamo.test_dynamic_shapes-479e026d710782d7.xml (deflated 28%) 2025-12-04T11:10:48.7679116Z adding: test/test-reports/python-pytest/inductor.test_custom_lowering/inductor.test_custom_lowering-5db5aae03b627061.xml (deflated 28%) 2025-12-04T11:10:48.7680204Z adding: test/test-reports/python-pytest/inductor.test_perf/inductor.test_perf-cdb6dc2507d07b9f.xml (deflated 28%) 2025-12-04T11:10:48.7724694Z adding: test/test-reports/python-pytest/inductor.test_mkldnn_pattern_matcher/inductor.test_mkldnn_pattern_matcher-d34f2a11a4dc09d6.xml (deflated 99%) 2025-12-04T11:10:48.7726234Z adding: test/test-reports/python-pytest/dynamo.test_deque_reconstruct/dynamo.test_deque_reconstruct-a848d884f2f29080.xml (deflated 28%) 2025-12-04T11:10:48.7727555Z adding: test/test-reports/python-pytest/inductor.test_utils/inductor.test_utils-bd079476913ef2e3.xml (deflated 28%) 2025-12-04T11:10:48.7728717Z adding: test/test-reports/python-pytest/inductor.test_indexing/inductor.test_indexing-fee711b29eff954c.xml (deflated 28%) 2025-12-04T11:10:48.7729769Z adding: test/test-reports/python-pytest/inductor.test_inductor_annotations/inductor.test_inductor_annotations-f4480b8427e264b4.xml (deflated 28%) 2025-12-04T11:10:48.7730842Z adding: test/test-reports/python-pytest/inductor.test_compile_worker/inductor.test_compile_worker-3d3c1537308173d9.xml (deflated 28%) 2025-12-04T11:10:48.7732033Z adding: test/test-reports/python-pytest/export.test_serialize/export.test_serialize-76457f9e334c8675.xml (deflated 28%) 2025-12-04T11:10:48.7733344Z adding: test/test-reports/python-pytest/export.test_export_strict/export.test_export_strict-3ce7562e53da487b.xml (deflated 28%) 2025-12-04T11:10:48.7734968Z adding: test/test-reports/python-pytest/dynamo.test_buffers_override/dynamo.test_buffers_override-bcc16daa85498c4b.xml (deflated 28%) 2025-12-04T11:10:48.7736429Z adding: test/test-reports/python-pytest/inductor.test_split_cat_fx_passes/inductor.test_split_cat_fx_passes-02502e259211870e.xml (deflated 28%) 2025-12-04T11:10:48.7737607Z adding: test/test-reports/python-pytest/inductor.test_cache/inductor.test_cache-258ddb4609e595e9.xml (deflated 28%) 2025-12-04T11:10:48.7738991Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-fd83be05005cff21.xml (deflated 28%) 2025-12-04T11:10:48.7740315Z adding: test/test-reports/python-pytest/inductor.test_control_flow/inductor.test_control_flow-9924173763b05f6a.xml (deflated 28%) 2025-12-04T11:10:48.7741262Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-9bd31207a046932a.xml (deflated 28%) 2025-12-04T11:10:48.7742080Z adding: test/test-reports/python-pytest/test_foreach/test_foreach-57977e6bb7997bc3.xml (deflated 98%) 2025-12-04T11:10:48.7742947Z adding: test/test-reports/python-pytest/nn.test_packed_sequence/nn.test_packed_sequence-dcd68b3499aab515.xml (deflated 28%) 2025-12-04T11:10:48.7743844Z adding: test/test-reports/python-pytest/test_numa_binding/test_numa_binding-62dfaebcfc08487c.xml (deflated 28%) 2025-12-04T11:10:48.7744835Z adding: test/test-reports/python-pytest/test_pruning_op/test_pruning_op-d12712c8cd1c0003.xml (deflated 28%) 2025-12-04T11:10:48.7745637Z adding: test/test-reports/python-pytest/test_jit_fuser_te/test_jit_fuser_te-0670cf00619133fb.xml (deflated 98%) 2025-12-04T11:10:48.7746661Z adding: test/test-reports/python-pytest/torch_np.numpy_tests.core.test_indexing/torch_np.numpy_tests.core.test_indexing-f6308bf864aa318d.xml (deflated 28%) 2025-12-04T11:10:48.7747633Z adding: test/test-reports/python-pytest/test_futures/test_futures-eeccb63775c4add0.xml (deflated 28%) 2025-12-04T11:10:48.7748509Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-d21bc49881a2e2b0.xml (deflated 28%) 2025-12-04T11:10:48.7749461Z adding: test/test-reports/python-pytest/test_scaled_matmul_cuda/test_scaled_matmul_cuda-d9bde1d5292f9755.xml (deflated 98%) 2025-12-04T11:10:48.7750523Z adding: test/test-reports/python-pytest/torch_np.numpy_tests.core.test_shape_base/torch_np.numpy_tests.core.test_shape_base-5523c0bbcadda606.xml (deflated 28%) 2025-12-04T11:10:48.7751840Z adding: test/test-reports/python-pytest/test_vulkan/test_vulkan-b9421a80edb6f140.xml (deflated 29%) 2025-12-04T11:10:48.7752677Z adding: test/test-reports/python-pytest/lazy.test_generator/lazy.test_generator-6425516431ce47a1.xml (deflated 28%) 2025-12-04T11:10:48.7753551Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-b82571327b265a63.xml (deflated 28%) 2025-12-04T11:10:48.7799611Z ##[group]Run # Remove any previous usage logs if they exist 2025-12-04T11:10:48.7800147Z # Remove any previous usage logs if they exist 2025-12-04T11:10:48.7800836Z rm -f logs-*.zip 2025-12-04T11:10:48.7801178Z zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt' || true 2025-12-04T11:10:48.7801666Z zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log' || true 2025-12-04T11:10:48.7810795Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:48.7811159Z env: 2025-12-04T11:10:48.7811372Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:48.7811630Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:48.7811937Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:48.7812453Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:48.7812918Z DEVICE_NAME: 2025-12-04T11:10:48.7813132Z DEVICE_TYPE: 2025-12-04T11:10:48.7813483Z FILE_SUFFIX: test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T11:10:48.7813977Z ##[endgroup] 2025-12-04T11:10:48.7880674Z adding: usage_log.txt (deflated 58%) 2025-12-04T11:10:48.7915988Z adding: test/test-reports/inductor.test_aot_inductor_5.5_cecd223de7bcc4ce_.log (deflated 92%) 2025-12-04T11:10:48.7916806Z adding: test/test-reports/inductor.test_torchinductor_codegen_dynamic_shapes_4.4_aaf0e808e3c0e61a_.log (deflated 53%) 2025-12-04T11:10:48.7918325Z adding: test/test-reports/inductor.test_torchinductor_opinfo_7.14_a33a8c2b64197fdb_.log (deflated 96%) 2025-12-04T11:10:48.8388541Z adding: test/test-reports/inductor.test_pattern_matcher_1.1_71c2676cd32e51e5_.log (deflated 96%) 2025-12-04T11:10:48.8391309Z adding: test/test-reports/inductor.test_cuda_repro_1.1_6d15300668add3fc_.log (deflated 98%) 2025-12-04T11:10:48.8392905Z adding: test/test-reports/dynamo.test_activation_checkpointing_1.1_3779be9d1a103562_.log (deflated 51%) 2025-12-04T11:10:48.8394279Z adding: test/test-reports/dynamo.test_logging_1.1_30144991ab7ca96e_.log (deflated 49%) 2025-12-04T11:10:48.8395569Z adding: test/test-reports/dynamo.test_repros_1.1_3a07b22e3f77e1e4_.log (deflated 97%) 2025-12-04T11:10:48.8396375Z adding: test/test-reports/inductor.test_flex_attention_2.6_90fdd3b7d65ce4b2_.log (deflated 50%) 2025-12-04T11:10:48.8397116Z adding: test/test-reports/inductor.test_flex_decoding_2.3_729d33e033147be3_.log (deflated 50%) 2025-12-04T11:10:48.8402545Z adding: test/test-reports/dynamo.test_fx_graph_runnable_1.1_ae8ca2ee8e2c6bb8_.log (deflated 96%) 2025-12-04T11:10:48.8403426Z adding: test/test-reports/inductor.test_online_softmax_1.1_40e23bca08e33227_.log (deflated 50%) 2025-12-04T11:10:48.8404170Z adding: test/test-reports/inductor.test_memory_1.1_4f8d5ba8b79fb015_.log (deflated 49%) 2025-12-04T11:10:48.8404916Z adding: test/test-reports/dynamo.test_streams_1.1_d5d585ff08c4417f_.log (deflated 49%) 2025-12-04T11:10:48.8405705Z adding: test/test-reports/inductor.test_unbacked_symints_1.1_0cb6fcdf41989d7a_.log (deflated 51%) 2025-12-04T11:10:48.8406430Z adding: test/test-reports/dynamo.test_aot_compile_1.1_2f925d88d81eb066_.log (deflated 51%) 2025-12-04T11:10:48.8407135Z adding: test/test-reports/test_privateuseone_python_backend_1.1_38648deebb02f255_.log (deflated 51%) 2025-12-04T11:10:48.8407841Z adding: test/test-reports/test_varlen_attention_1.1_ca1d0403b32fd758_.log (deflated 49%) 2025-12-04T11:10:48.8409896Z adding: test/test-reports/test_autograd_1.1_42a7f311e778fd6d_.log (deflated 98%) 2025-12-04T11:10:48.8411011Z adding: test/test-reports/test_ops_fwd_gradients_7.12_01dcdd6b3df592a7_.log (deflated 95%) 2025-12-04T11:10:48.8412019Z adding: test/test-reports/test_ops_gradients_3.10_eaaa384ce8d837b0_.log (deflated 49%) 2025-12-04T11:10:48.8412650Z adding: test/test-reports/test_nestedtensor_1.4_e15b8d770f04fe67_.log (deflated 49%) 2025-12-04T11:10:48.8413397Z adding: test/test-reports/test_sparse_csr_2.2_8d87173a8ebf96a6_.log (deflated 95%) 2025-12-04T11:10:48.8417732Z adding: test/test-reports/test_overrides_1.1_7d80d054d5d65ec3_.log (deflated 98%) 2025-12-04T11:10:48.8418474Z adding: test/test-reports/test_torchfuzz_repros_1.1_fcf26854fba7f69d_.log (deflated 49%) 2025-12-04T11:10:48.8419301Z adding: test/test-reports/inductor.test_group_batch_fusion_1.1_fc8633d8b0944f89_.log (deflated 73%) 2025-12-04T11:10:48.8420109Z adding: test/test-reports/dynamo.test_dynamic_shapes_1.1_d62bd541228f238b_.log (deflated 50%) 2025-12-04T11:10:48.8420811Z adding: test/test-reports/inductor.test_custom_lowering_1.1_4af35be031667708_.log (deflated 51%) 2025-12-04T11:10:48.8421650Z adding: test/test-reports/inductor.test_perf_1.1_4a62740f7bb60b7e_.log (deflated 49%) 2025-12-04T11:10:48.8422401Z adding: test/test-reports/inductor.test_mkldnn_pattern_matcher_1.2_ef762047374873df_.log (deflated 94%) 2025-12-04T11:10:48.8423193Z adding: test/test-reports/inductor.test_cpu_cpp_wrapper_1.1_5bf9de19d872c7a4_.log (stored 0%) 2025-12-04T11:10:48.8424027Z adding: test/test-reports/dynamo.test_deque_reconstruct_1.1_90d10d742de65891_.log (deflated 50%) 2025-12-04T11:10:48.8424926Z adding: test/test-reports/inductor.test_utils_1.1_e3f0ace05e84ce51_.log (deflated 49%) 2025-12-04T11:10:48.8425695Z adding: test/test-reports/inductor.test_indexing_1.1_29ca32c31300a4db_.log (deflated 50%) 2025-12-04T11:10:48.8426488Z adding: test/test-reports/inductor.test_inductor_annotations_1.1_6c00e9799b0015c9_.log (deflated 51%) 2025-12-04T11:10:48.8427342Z adding: test/test-reports/inductor.test_compile_worker_1.1_98dfa7ecc1e7975c_.log (deflated 50%) 2025-12-04T11:10:48.8428103Z adding: test/test-reports/export.test_serialize_1.1_91837aecdd2957b9_.log (deflated 49%) 2025-12-04T11:10:48.8428827Z adding: test/test-reports/export.test_export_strict_1.1_be0a1028c7ae3647_.log (deflated 50%) 2025-12-04T11:10:48.8429596Z adding: test/test-reports/dynamo.test_buffers_override_1.1_2b8279aa55d222a3_.log (deflated 50%) 2025-12-04T11:10:48.8430306Z adding: test/test-reports/inductor.test_split_cat_fx_passes_1.1_e5f4e0f1d101e9e4_.log (deflated 51%) 2025-12-04T11:10:48.8430987Z adding: test/test-reports/inductor.test_cache_1.1_eb92e1345f8620a8_.log (deflated 49%) 2025-12-04T11:10:48.8431657Z adding: test/test-reports/inductor.test_aot_inductor_utils_1.1_9a97251034915f58_.log (deflated 51%) 2025-12-04T11:10:48.8432360Z adding: test/test-reports/inductor.test_control_flow_3.4_c7b62bd639790f0d_.log (deflated 50%) 2025-12-04T11:10:48.8433184Z adding: test/test-reports/test_cpp_api_parity_1.1_9be5968028a54dc2_.log (deflated 49%) 2025-12-04T11:10:48.8433966Z adding: test/test-reports/test_foreach_2.2_7bef0d80f11b45cf_.log (deflated 97%) 2025-12-04T11:10:48.8434657Z adding: test/test-reports/nn.test_packed_sequence_1.1_be9e7691d673e5c2_.log (deflated 50%) 2025-12-04T11:10:48.8435334Z adding: test/test-reports/test_numa_binding_1.1_c202ae5c6ae9ae6d_.log (deflated 49%) 2025-12-04T11:10:48.8435933Z adding: test/test-reports/test_pruning_op_1.1_fe9ee687d0fcc012_.log (deflated 49%) 2025-12-04T11:10:48.8437053Z adding: test/test-reports/test_jit_fuser_te_1.1_fa00452da479ebdd_.log (deflated 96%) 2025-12-04T11:10:48.8437761Z adding: test/test-reports/optim.test_lrscheduler_1.1_1d3afec10ecbb641_.log (deflated 7%) 2025-12-04T11:10:48.8438619Z adding: test/test-reports/torch_np.numpy_tests.core.test_indexing_1.1_d443f17b87365435_.log (deflated 52%) 2025-12-04T11:10:48.8439312Z adding: test/test-reports/test_futures_1.1_8685b00496b368c3_.log (deflated 48%) 2025-12-04T11:10:48.8440114Z adding: test/test-reports/test_tensor_creation_ops_1.1_0beed75c1e86aa59_.log (deflated 50%) 2025-12-04T11:10:48.8441108Z adding: test/test-reports/test_scaled_matmul_cuda_1.1_dada766423d50405_.log (deflated 95%) 2025-12-04T11:10:48.8441926Z adding: test/test-reports/torch_np.numpy_tests.core.test_shape_base_1.1_9d39f3ab13cf756c_.log (deflated 52%) 2025-12-04T11:10:48.8442689Z adding: test/test-reports/test_vulkan_1.1_a3b26789c3daa3f3_.log (deflated 48%) 2025-12-04T11:10:48.8443275Z adding: test/test-reports/lazy.test_generator_1.1_2f4441dec655dc71_.log (deflated 49%) 2025-12-04T11:10:48.8443886Z adding: test/test-reports/nn.test_convolution_1.2_1888800342d74d79_.log (deflated 49%) 2025-12-04T11:10:48.8476665Z ##[group]Run # Remove any previous debugging artifacts if they exist 2025-12-04T11:10:48.8477373Z # Remove any previous debugging artifacts if they exist 2025-12-04T11:10:48.8477764Z rm -f debug-*.zip 2025-12-04T11:10:48.8478036Z if [ -d 'test/debug' ]; then 2025-12-04T11:10:48.8478381Z  zip -r "debug-${FILE_SUFFIX}.zip" test/debug 2025-12-04T11:10:48.8478700Z fi 2025-12-04T11:10:48.8487268Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:48.8487624Z env: 2025-12-04T11:10:48.8487829Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:48.8488089Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:48.8488388Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:48.8488896Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:48.8489358Z DEVICE_NAME: 2025-12-04T11:10:48.8489631Z DEVICE_TYPE: 2025-12-04T11:10:48.8489971Z FILE_SUFFIX: test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207 2025-12-04T11:10:48.8490379Z ##[endgroup] 2025-12-04T11:10:48.8599151Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T11:10:48.8599469Z with: 2025-12-04T11:10:48.8599740Z s3-bucket: gha-artifacts 2025-12-04T11:10:48.8600055Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:48.8600629Z retention-days: 14 2025-12-04T11:10:48.8600876Z if-no-files-found: warn 2025-12-04T11:10:48.8601146Z path: test-jsons-*.zip 2025-12-04T11:10:48.8601387Z name: artifact 2025-12-04T11:10:48.8601608Z region: us-east-1 2025-12-04T11:10:48.8601829Z env: 2025-12-04T11:10:48.8602029Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:48.8602287Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:48.8602590Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:48.8603113Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:48.8603573Z DEVICE_NAME: 2025-12-04T11:10:48.8603794Z DEVICE_TYPE: 2025-12-04T11:10:48.8604055Z ##[endgroup] 2025-12-04T11:10:49.4024419Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T11:10:49.4024864Z With the provided path, there will be 1 file uploaded 2025-12-04T11:10:49.4025290Z Uploading to s3 prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:49.4098513Z Starting upload of test-jsons-test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207.zip 2025-12-04T11:10:49.5598711Z Finished upload of test-jsons-test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207.zip 2025-12-04T11:10:49.5924758Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T11:10:49.5925112Z with: 2025-12-04T11:10:49.5925348Z s3-bucket: gha-artifacts 2025-12-04T11:10:49.5925901Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:49.5926241Z retention-days: 14 2025-12-04T11:10:49.5926483Z if-no-files-found: error 2025-12-04T11:10:49.5926758Z path: test-reports-*.zip 2025-12-04T11:10:49.5927009Z name: artifact 2025-12-04T11:10:49.5927223Z region: us-east-1 2025-12-04T11:10:49.5927439Z env: 2025-12-04T11:10:49.5927646Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:49.5927890Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:49.5928197Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:49.5928718Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:49.5929185Z DEVICE_NAME: 2025-12-04T11:10:49.5929397Z DEVICE_TYPE: 2025-12-04T11:10:49.5929871Z ##[endgroup] 2025-12-04T11:10:50.1797465Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T11:10:50.1798292Z With the provided path, there will be 1 file uploaded 2025-12-04T11:10:50.1799129Z Uploading to s3 prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:50.1871661Z Starting upload of test-reports-test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207.zip 2025-12-04T11:10:50.3623187Z Finished upload of test-reports-test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207.zip 2025-12-04T11:10:50.3928189Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T11:10:50.3928504Z with: 2025-12-04T11:10:50.3928723Z s3-bucket: gha-artifacts 2025-12-04T11:10:50.3929033Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:50.3929362Z retention-days: 14 2025-12-04T11:10:50.3929634Z if-no-files-found: ignore 2025-12-04T11:10:50.3929893Z path: logs-*.zip 2025-12-04T11:10:50.3930115Z name: artifact 2025-12-04T11:10:50.3930345Z region: us-east-1 2025-12-04T11:10:50.3930576Z env: 2025-12-04T11:10:50.3930784Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:50.3931047Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:50.3931359Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:50.3931885Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:50.3932349Z DEVICE_NAME: 2025-12-04T11:10:50.3932567Z DEVICE_TYPE: 2025-12-04T11:10:50.3932913Z ##[endgroup] 2025-12-04T11:10:50.7320604Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T11:10:50.7321038Z With the provided path, there will be 1 file uploaded 2025-12-04T11:10:50.7321463Z Uploading to s3 prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:50.7394000Z Starting upload of logs-test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207.zip 2025-12-04T11:10:50.8634421Z Finished upload of logs-test-default-5-8-linux.g5.4xlarge.nvidia.gpu_57118183207.zip 2025-12-04T11:10:50.8932842Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T11:10:50.8933181Z with: 2025-12-04T11:10:50.8933402Z s3-bucket: gha-artifacts 2025-12-04T11:10:50.8933716Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T11:10:50.8934046Z retention-days: 14 2025-12-04T11:10:50.8934302Z if-no-files-found: ignore 2025-12-04T11:10:50.8934565Z path: debug-*.zip 2025-12-04T11:10:50.8934795Z name: artifact 2025-12-04T11:10:50.8935058Z region: us-east-1 2025-12-04T11:10:50.8935283Z env: 2025-12-04T11:10:50.8935489Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:50.8935749Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:50.8936059Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:50.8936599Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:50.8937056Z DEVICE_NAME: 2025-12-04T11:10:50.8937413Z DEVICE_TYPE: 2025-12-04T11:10:50.8937640Z ##[endgroup] 2025-12-04T11:10:51.2233517Z No files were found with the provided path: debug-*.zip. No artifacts will be uploaded. 2025-12-04T11:10:51.2527893Z ##[group]Run # shellcheck disable=SC2156 2025-12-04T11:10:51.2528256Z # shellcheck disable=SC2156 2025-12-04T11:10:51.2528812Z find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2025-12-04T11:10:51.2538906Z shell: /usr/bin/bash -e {0} 2025-12-04T11:10:51.2539169Z env: 2025-12-04T11:10:51.2539391Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:51.2539653Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:51.2539957Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:51.2540482Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:51.2540948Z DEVICE_NAME: 2025-12-04T11:10:51.2541163Z DEVICE_TYPE: 2025-12-04T11:10:51.2541402Z ##[endgroup] 2025-12-04T11:10:51.6578653Z Prepare all required actions 2025-12-04T11:10:51.6579025Z Getting action download info 2025-12-04T11:10:51.8593197Z Download action repository 'actions/setup-python@v6' (SHA:83679a892e2d95755f2dac6acb0bfd1e9ac5d548) 2025-12-04T11:10:53.5176483Z ##[group]Run ./.github/actions/upload-utilization-stats 2025-12-04T11:10:53.5176844Z with: 2025-12-04T11:10:53.5177047Z job_id: 57118183207 2025-12-04T11:10:53.5177707Z job_name: linux-jammy-cuda12.8-py3-gcc11-slow-gradcheck / test (default, 5, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck, rerun_disabled_tests) 2025-12-04T11:10:53.5178423Z workflow_name: periodic 2025-12-04T11:10:53.5178688Z workflow_run_id: 19922826259 2025-12-04T11:10:53.5178953Z workflow_attempt: 1 2025-12-04T11:10:53.5179188Z env: 2025-12-04T11:10:53.5179399Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:53.5179653Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:53.5179961Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:53.5180510Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:53.5180988Z DEVICE_NAME: 2025-12-04T11:10:53.5181212Z DEVICE_TYPE: 2025-12-04T11:10:53.5181431Z ##[endgroup] 2025-12-04T11:10:53.5288079Z ##[group]Run actions/setup-python@v6 2025-12-04T11:10:53.5288533Z with: 2025-12-04T11:10:53.5288863Z python-version: 3.10 2025-12-04T11:10:53.5289237Z check-latest: false 2025-12-04T11:10:53.5289778Z token: *** 2025-12-04T11:10:53.5290121Z update-environment: true 2025-12-04T11:10:53.5290539Z allow-prereleases: false 2025-12-04T11:10:53.5290940Z freethreaded: false 2025-12-04T11:10:53.5291467Z env: 2025-12-04T11:10:53.5291787Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:53.5292187Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:53.5292662Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:53.5293518Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:53.5294288Z DEVICE_NAME: 2025-12-04T11:10:53.5294670Z DEVICE_TYPE: 2025-12-04T11:10:53.5295006Z ##[endgroup] 2025-12-04T11:10:53.8388872Z ##[group]Installed versions 2025-12-04T11:10:53.8398594Z Version 3.10 was not found in the local cache 2025-12-04T11:10:53.8591002Z (node:238802) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2025-12-04T11:10:53.8591766Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-12-04T11:10:54.2985759Z ##[error]The version '3.10' with architecture 'x64' was not found for this operating system. The list of all available versions can be found here: https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json 2025-12-04T11:10:54.3185019Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-12-04T11:10:54.3185422Z with: 2025-12-04T11:10:54.3185620Z env: 2025-12-04T11:10:54.3185856Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:54.3186131Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:54.3186432Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:54.3187063Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:54.3187526Z DEVICE_NAME: 2025-12-04T11:10:54.3187756Z DEVICE_TYPE: 2025-12-04T11:10:54.3187969Z ##[endgroup] 2025-12-04T11:10:54.3246306Z ##[group]Run set -eou pipefail 2025-12-04T11:10:54.3246603Z set -eou pipefail 2025-12-04T11:10:54.3246848Z  2025-12-04T11:10:54.3247196Z echo "Holding runner for 2 hours until all ssh sessions have logged out" 2025-12-04T11:10:54.3247629Z for _ in $(seq 1440); do 2025-12-04T11:10:54.3247971Z  # Break if no ssh session exists anymore 2025-12-04T11:10:54.3248297Z  if [ "$(who)" = "" ]; then 2025-12-04T11:10:54.3248577Z  break 2025-12-04T11:10:54.3248799Z  fi 2025-12-04T11:10:54.3249019Z  echo "." 2025-12-04T11:10:54.3249241Z  sleep 5 2025-12-04T11:10:54.3249468Z done 2025-12-04T11:10:54.3258486Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:54.3258837Z env: 2025-12-04T11:10:54.3259044Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:54.3259296Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:54.3259595Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:54.3260117Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:54.3260584Z DEVICE_NAME: 2025-12-04T11:10:54.3260795Z DEVICE_TYPE: 2025-12-04T11:10:54.3261017Z ##[endgroup] 2025-12-04T11:10:54.3292458Z Holding runner for 2 hours until all ssh sessions have logged out 2025-12-04T11:10:54.3799215Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T11:10:54.3799829Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T11:10:54.3800409Z # shellcheck disable=SC2046 2025-12-04T11:10:54.3800736Z docker stop $(docker ps -q) || true 2025-12-04T11:10:54.3801080Z # Prune all of the docker images 2025-12-04T11:10:54.3801392Z docker system prune -af 2025-12-04T11:10:54.3810569Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:10:54.3811087Z env: 2025-12-04T11:10:54.3811362Z GIT_DEFAULT_BRANCH: main 2025-12-04T11:10:54.3811628Z HAS_NVIDIA_GPU: true 2025-12-04T11:10:54.3811928Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T11:10:54.3812444Z DOCKER_CONTAINER_ID: 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:10:54.3812911Z DEVICE_NAME: 2025-12-04T11:10:54.3813255Z DEVICE_TYPE: 2025-12-04T11:10:54.3813467Z ##[endgroup] 2025-12-04T11:11:10.6498796Z 2719baa28228 2025-12-04T11:11:11.9370209Z Deleted Containers: 2025-12-04T11:11:11.9370703Z 2719baa282289aab254b6b7924d50238356990eddd907f5b84dd297bc3e312ea 2025-12-04T11:11:11.9371013Z 2025-12-04T11:11:26.9119554Z Deleted Images: 2025-12-04T11:11:26.9120445Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T11:11:26.9121675Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image@sha256:ba21003510dba4bdeed83df81a56fa468e0ee1b612a9445ae1f402a280804f97 2025-12-04T11:11:26.9122519Z deleted: sha256:add7313791033822205cdb3cf32096534b2cfaa4855bd48119b59000bfe00301 2025-12-04T11:11:26.9123125Z deleted: sha256:85a76b7bf29ad34eb76cce6f46af5d49a58b6272f80f983d5c769e82c7749301 2025-12-04T11:11:26.9123727Z deleted: sha256:0882f3ce59ff5ae30195ee4b059fc713e13eda107a3a7814a4616ac9058a30a4 2025-12-04T11:11:26.9124661Z deleted: sha256:64ba5b9344c11a3e4729136076830b90ac4cf1554046edb1bd4f0784b66ebd9b 2025-12-04T11:11:26.9125258Z deleted: sha256:88213c59cf461a65ab9b6cb07b4195dc9d41b5241c152daa002c7b3112e09124 2025-12-04T11:11:26.9125868Z deleted: sha256:4c0f83afa802ffbc05ebaf1aa50e48a2447c7c295549a6dded80ac63437906ca 2025-12-04T11:11:26.9126484Z deleted: sha256:6f7ec74460e8fb070c8209949095ea3be5f4e2fd69c9f750cd39ac4093f5e64b 2025-12-04T11:11:26.9127167Z deleted: sha256:d6928b0d1021b31942fdcb64e5eb4a34682de66e959dd424ed6ed02c29cd706d 2025-12-04T11:11:26.9127771Z deleted: sha256:4e9fbcb1705a6351bb34dd320558752614308636b94fd9ae6f26063e3deadc0a 2025-12-04T11:11:26.9128369Z deleted: sha256:43aabd0201f48712f21758071352dea029b4de37be08b2e2197706856a9ecbf2 2025-12-04T11:11:26.9128952Z deleted: sha256:940a98dec78303f0548beb1033242a45e9097607ef3e55c8b949b69b73d1b95e 2025-12-04T11:11:26.9129551Z deleted: sha256:d2849fa0e0411cf66e4408831d70e38838afb55b11a80c1c4d8aa0ae7dc9ca40 2025-12-04T11:11:26.9130146Z deleted: sha256:14f40d23c20c7e562623f89deb376520296758bc39dd3c77284049b84ebd8a31 2025-12-04T11:11:26.9130767Z deleted: sha256:a8ccba61f90ca097cb391d0f4fbed0d9f821d06b00e28f7332e9e2dcfcbac4ca 2025-12-04T11:11:26.9131362Z deleted: sha256:91b2060d290547d3b517d4a11d994bbe23f4560b5546cb91918ca1828dde6be1 2025-12-04T11:11:26.9131964Z deleted: sha256:b42a184755715dcfead7fad655a127433541d316d9628f5f730ff17ad5f8071c 2025-12-04T11:11:26.9132573Z deleted: sha256:aa5b4f3c9169061dc3c6da0e677e8a86f11ecb0a3f9fb4861ab3d8c04379775c 2025-12-04T11:11:26.9133176Z deleted: sha256:b4dcf450081a48d77fea0a21b8d810a69c03608a595e754fe7d365058d0579b7 2025-12-04T11:11:26.9133779Z deleted: sha256:4f7fe12d3d4f5bf890c7ada4ce16f17a105472aa6509a778f917dcce2f28174b 2025-12-04T11:11:26.9134392Z deleted: sha256:2d1d5a74182594f9a8553df00fdcfc809dba407bcd6700d667f862cbe9d555ce 2025-12-04T11:11:26.9135063Z deleted: sha256:d901e2f5d449aeed16b727bdcc11fc0e0f6c30c8fc5c39ac7eeac8a74d9d176c 2025-12-04T11:11:26.9135660Z deleted: sha256:a04df2603bd12372c6632469a9a81ebc4a8d677452c250672b9692884fa6a452 2025-12-04T11:11:26.9136260Z deleted: sha256:f438a6b52273a552dc3820a55c74c53a62a0eae9f2a7d21b37125add7d71639f 2025-12-04T11:11:26.9136860Z deleted: sha256:d4b09517e9518d709ac98b0ae6f8446ec9ac51688253607b1fca67aa2c87b3f4 2025-12-04T11:11:26.9137466Z deleted: sha256:c1fa38335237f5e7263e39d3d3de98215bcfbbb12b826955c02e149bf68efd13 2025-12-04T11:11:26.9138059Z deleted: sha256:c898d20a30de901fca74d7611663b17ab48e1726a11e031e40548ed16ee81877 2025-12-04T11:11:26.9140600Z deleted: sha256:3baceec7096518fcc10696feba551639d698b3145c2fc09cac927bb60c0fd751 2025-12-04T11:11:26.9141203Z deleted: sha256:5245aaaa3d5c3a19f76b9a6c920bd82d1a0ff5289f87c8c109652089709d9b3b 2025-12-04T11:11:26.9141792Z deleted: sha256:f05cc789b95246938c377f474c41187965b89ceac0250e7d5124bec32153f447 2025-12-04T11:11:26.9142376Z deleted: sha256:07ec4fc008de4e7a2c794ec7094cc72e0d287c04c8b2156163aee0bae147fe2d 2025-12-04T11:11:26.9142976Z deleted: sha256:c6302601ad5fde573c1f8c900250478fca7fdc6907d8fd4fae651b94b4d9264d 2025-12-04T11:11:26.9143712Z deleted: sha256:cc5e955ee1dc54931f02606c5ea87aae14f03b5d764492be611480ab041f2882 2025-12-04T11:11:26.9144304Z deleted: sha256:f21c03518996d98452338f4e80bcfd9b139a1dab155f4830be0d3f623035269f 2025-12-04T11:11:26.9145141Z deleted: sha256:519ca6f1279f7886f25f0005527cfa627deebbc5b7d7cdbfa7ef962bcfc4c26d 2025-12-04T11:11:26.9145788Z deleted: sha256:0ef990495216807d0175b192045be3f617e72331bc373b3434807f41bf69168d 2025-12-04T11:11:26.9146379Z deleted: sha256:7093edf7319e1f0e01654c3224e32c8dede5b948d106e0b9b03cbf0bb1091e33 2025-12-04T11:11:26.9146968Z deleted: sha256:c478161e058e2f4041555c3e880b95ee1ee047938dc58549a3a88135740996ae 2025-12-04T11:11:26.9147556Z deleted: sha256:9bb853b0d938cd7c36a80ce8ee40653f2c0ff92719209b11beb03acc8855ce3e 2025-12-04T11:11:26.9148161Z deleted: sha256:fdf2ace71a78ce6910ef9c4b073c195531da47022443b606bb92dcd6499b6afc 2025-12-04T11:11:26.9148763Z deleted: sha256:576c2b3770d871937d3cfb7014328bcb4bd1aed0c28bc438764b3bfdac4c1ac2 2025-12-04T11:11:26.9149363Z deleted: sha256:878e92b9cb82de09ac14a9d5f3f7bc2411a799b6f54d0d64b78c2bb4d1fdc0fc 2025-12-04T11:11:26.9150087Z deleted: sha256:85c8c3b98b65a6695f988a10cc66c981d73a3ef03eda15b8e14d227b50b56300 2025-12-04T11:11:26.9150702Z deleted: sha256:ce2ab3ba07794f9ee95d6ea7de6dcd3d2aed96561f9a79192dd56ca5bf29313a 2025-12-04T11:11:26.9151310Z deleted: sha256:37a6e12976ca957286977e696e63012ab9821214b0483fe1a48d29dcb280508a 2025-12-04T11:11:26.9151945Z deleted: sha256:cd1d5d3dd7038144ca6fe961c0d4c8e705625ae0c36190ba8b3e9602abedad19 2025-12-04T11:11:26.9152541Z deleted: sha256:0e707276e0be2e0008b86d594fadc0d16444d66c4fb7227c56f144cbb3c2affd 2025-12-04T11:11:26.9153143Z deleted: sha256:22d4aad6a2ada91b341c1225a0f314042b8aeabef7568c5c019709b058bf070b 2025-12-04T11:11:26.9153751Z deleted: sha256:ee4adacf4e0933131d0275eddad406b3c8147e6cf07a292b99f1aff4b5355f33 2025-12-04T11:11:26.9154358Z deleted: sha256:43da0b9e7c0e18403dcb834e53628dc7c970ccb2dbd091878c0d7c0170dbc97f 2025-12-04T11:11:26.9154963Z deleted: sha256:00571684bdcd75beda15eb7d4e79b5458bc914350f9bb4d87fcdc97ad15e0da1 2025-12-04T11:11:26.9155570Z deleted: sha256:41615f09950259f1d75e82ef35b6fc53b18fe71ebff143744cfd51009d04349e 2025-12-04T11:11:26.9156163Z deleted: sha256:75ab34d2eed3c7915467a506ab6dab2711918fbabe94add2fb5c62780221ab0c 2025-12-04T11:11:26.9156772Z deleted: sha256:0a39ef2bebf44c1c3893d1e5fb42dad48b8fac7ca673141267ee967f85455e89 2025-12-04T11:11:26.9157383Z deleted: sha256:9b7d024e48ba1f9824a54597621b1b062cbc4aa41a77d81ca538d6b5c24a612c 2025-12-04T11:11:26.9157976Z deleted: sha256:392257172de6434c271bd93394218a91e9aa86d7c18abc2f2759317b9d5fb6de 2025-12-04T11:11:26.9158547Z deleted: sha256:6c3232860b930866a463a356124fc392c7e5f04895695229257e8c3e8a02711d 2025-12-04T11:11:26.9159131Z deleted: sha256:63dd55b807215e2fa6c715419ac0c5072d02dddc848dbf74bb7e77b906b5eaed 2025-12-04T11:11:26.9159802Z deleted: sha256:07a8738c1b4584db72ed9aa60f5274321eb0ba16263450da3a75df8326ebc25f 2025-12-04T11:11:26.9160442Z deleted: sha256:053fe2965b01281d12040ec1893e0d1aa77362a49ea9a1067402272c69dad9f5 2025-12-04T11:11:26.9161046Z deleted: sha256:7857fb5eb181c4e80262ecab60bdd3c266cf3d1409ceb76c05882609b416a8d3 2025-12-04T11:11:26.9161651Z deleted: sha256:752528477fc99089de3bd2c6da7b30cf34f2e901fe06d8fcfe685b411461e883 2025-12-04T11:11:26.9162254Z deleted: sha256:cce0210e2f4b042601813df03aa294a86b0c668fcfc75f4c63f6fa12b2952e15 2025-12-04T11:11:26.9162843Z deleted: sha256:f2bb405a26705ecd12d21380d26d9355d01db3a2175080fbdb468f2b5a25a76c 2025-12-04T11:11:26.9163461Z deleted: sha256:ad430120d4ffbaf97cd8d6de6ea8eefa4a8f80ec45f0b176c6b26bff0970fd33 2025-12-04T11:11:26.9164092Z deleted: sha256:225a4910baea7cc540ed43eeac75046293800ab0b8e0192b51e991c8cb50bcf3 2025-12-04T11:11:26.9164729Z deleted: sha256:a259945b0c3507f049fbac10fb3d3ffe43d45e83c91b80ae8cd1dafb855ad83c 2025-12-04T11:11:26.9165325Z deleted: sha256:862a98881b1d5adad5c21d01602773b894794097de80964ef8f47bcaadb43255 2025-12-04T11:11:26.9165910Z deleted: sha256:1cf6d3c8b6c2694b79a2d08719594903811c330a36a4c7a8a7153a350b53d292 2025-12-04T11:11:26.9166560Z deleted: sha256:232a1ae8b0fee817ff7838bb5986a2f38377d3b1dbbf5217b576af0f953b0844 2025-12-04T11:11:26.9167157Z deleted: sha256:c72c5705dabd6314423dd7d4fb260a20d5d9886b2ebce60d19e9d78c4a2335c2 2025-12-04T11:11:26.9167745Z deleted: sha256:296734cf81fd92c913884d058908598424ffe072676e38de289bbab83768c7bd 2025-12-04T11:11:26.9168321Z deleted: sha256:7c76040481b889847a1804021aeff07547eaa4ee706d6137db218d497a8fd9c1 2025-12-04T11:11:26.9168917Z deleted: sha256:d5e293f5b354e8cbcc6de893ea72cc632b02d8fdfbb08ec3127c4e9662f3ebff 2025-12-04T11:11:26.9169520Z deleted: sha256:f35a64e429c88e249645090f21fbe7dae108d98e0ab4ea13184f24b3fd66c315 2025-12-04T11:11:26.9170117Z deleted: sha256:ce6ae8d595c8e69115c51b1ce4f9a9158795d7b863b1cb53f21c39a87974d41b 2025-12-04T11:11:26.9170717Z deleted: sha256:8941abaee59400fb9b3a60765fea4a1fc2a6a447467a6d983e84c7f72494a450 2025-12-04T11:11:26.9171310Z deleted: sha256:ef53c29a9a2c2bc80ffdb9bfaf92842436b5755ec1ce828b9d11e5e27d656ea1 2025-12-04T11:11:26.9171923Z deleted: sha256:7a347fb0acb43f1c814f8c8ff21185e8b5cf64d7bc5988cea060f77d906e08b5 2025-12-04T11:11:26.9172618Z deleted: sha256:cc855dc9be79496e15175569dced2d13477e50b077a5fd3945f9bf50018880c1 2025-12-04T11:11:26.9173300Z deleted: sha256:f7a9946ada3d4786658bc0b643808bb32a9a45e4e90e30dc43ee19e2dbe24024 2025-12-04T11:11:26.9173885Z deleted: sha256:c22a9215f62812c1d2e32827f5221ff556c5b6702aadbdab6b87b8293f19635e 2025-12-04T11:11:26.9174518Z deleted: sha256:959a56746620012e37c1def1a83c5afb1e7c0adc59b021a28beb53c24df98032 2025-12-04T11:11:26.9175111Z deleted: sha256:31a0fff0695bf6100c17954be72eab2095b466d559c75c3faf2a17d8c41e6ebe 2025-12-04T11:11:26.9175690Z deleted: sha256:c15e2b5241b9e55af1b2593e544391b4b44d0505e6528e8f12425136e93b424c 2025-12-04T11:11:26.9176278Z deleted: sha256:73974f74b436f39a2fdb6461b1e3f7c3e41c73325776fa71d16b942a5b4a365b 2025-12-04T11:11:26.9176777Z untagged: public.ecr.aws/docker/library/python:3.13 2025-12-04T11:11:26.9177446Z untagged: public.ecr.aws/docker/library/python@sha256:3f986299a7b8b44b0d8cf9bda2b22361ce5c3058ef5d7cb17fb7452506680ab0 2025-12-04T11:11:26.9178225Z deleted: sha256:44438aecfedf7b6086fce506dae0db5ba7fc0027f9b743f1a75a6b5cbc7de70a 2025-12-04T11:11:26.9178835Z deleted: sha256:6f09a1f5d8a107c2532fbd116e75116cb75fa77b1a7d72d3bdf1ac12de152acd 2025-12-04T11:11:26.9179446Z deleted: sha256:fe5f3ac0be086125eb1e3cd10cc33e8e426f4e079381f7ce5a987b626e99fa67 2025-12-04T11:11:26.9180050Z deleted: sha256:79dd2061a22cf919cfc4f1f02704bfda09afadb017265e670ee54441d296c06c 2025-12-04T11:11:26.9180666Z deleted: sha256:9447ad402aafdbee17e999b0ec84ad89c2646dbebf054d469d4f8bee77f66212 2025-12-04T11:11:26.9181265Z deleted: sha256:7a4909f3c1975be52292f53107495ee1b41c17494918767ccedf1cf1688ae318 2025-12-04T11:11:26.9181845Z deleted: sha256:3474923d97f1f498237650a7d51bd4aea37d5e6b9d8a778777920584af5dd560 2025-12-04T11:11:26.9182432Z deleted: sha256:683afd1773444401a9cbd24842ee5d9154a11abb4fab63ddea5c03df788597ee 2025-12-04T11:11:26.9182793Z 2025-12-04T11:11:26.9182913Z Total reclaimed space: 35.68GB 2025-12-04T11:11:26.9254872Z Post job cleanup. 2025-12-04T11:11:26.9305637Z Post job cleanup. 2025-12-04T11:11:27.0690699Z (node:238960) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2025-12-04T11:11:27.0691432Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-12-04T11:11:27.0888576Z Post job cleanup. 2025-12-04T11:11:27.0933199Z Post job cleanup. 2025-12-04T11:11:27.1918667Z [command]/usr/bin/git version 2025-12-04T11:11:27.1965123Z git version 2.50.1 2025-12-04T11:11:27.2006965Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/050e8d3d-42e0-48db-9c19-88e7d83bf9f0/.gitconfig' 2025-12-04T11:11:27.2018695Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/050e8d3d-42e0-48db-9c19-88e7d83bf9f0' before making global git config changes 2025-12-04T11:11:27.2020105Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T11:11:27.2025273Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T11:11:27.2071119Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T11:11:27.2120013Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T11:11:27.2522271Z Entering 'android/libs/fbjni' 2025-12-04T11:11:27.2602325Z Entering 'third_party/FP16' 2025-12-04T11:11:27.2683246Z Entering 'third_party/FXdiv' 2025-12-04T11:11:27.2763270Z Entering 'third_party/NNPACK' 2025-12-04T11:11:27.2843288Z Entering 'third_party/NVTX' 2025-12-04T11:11:27.2924161Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T11:11:27.3005126Z Entering 'third_party/XNNPACK' 2025-12-04T11:11:27.3099562Z Entering 'third_party/aiter' 2025-12-04T11:11:27.3180503Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T11:11:27.3269764Z Entering 'third_party/benchmark' 2025-12-04T11:11:27.3350304Z Entering 'third_party/composable_kernel' 2025-12-04T11:11:27.3439607Z Entering 'third_party/cpp-httplib' 2025-12-04T11:11:27.3519176Z Entering 'third_party/cpuinfo' 2025-12-04T11:11:27.3597743Z Entering 'third_party/cudnn_frontend' 2025-12-04T11:11:27.3682739Z Entering 'third_party/cutlass' 2025-12-04T11:11:27.3773458Z Entering 'third_party/fbgemm' 2025-12-04T11:11:27.3857901Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T11:11:27.3935838Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T11:11:27.4020089Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T11:11:27.4097406Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T11:11:27.4182832Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T11:11:27.4260644Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T11:11:27.4336324Z Entering 'third_party/fbgemm/external/json' 2025-12-04T11:11:27.4419441Z Entering 'third_party/flash-attention' 2025-12-04T11:11:27.4501950Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T11:11:27.4587264Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T11:11:27.4676988Z Entering 'third_party/flatbuffers' 2025-12-04T11:11:27.4761737Z Entering 'third_party/fmt' 2025-12-04T11:11:27.4841290Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T11:11:27.4920951Z Entering 'third_party/gloo' 2025-12-04T11:11:27.5001194Z Entering 'third_party/googletest' 2025-12-04T11:11:27.5080861Z Entering 'third_party/ideep' 2025-12-04T11:11:27.5160828Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T11:11:27.5248018Z Entering 'third_party/ittapi' 2025-12-04T11:11:27.5328668Z Entering 'third_party/kineto' 2025-12-04T11:11:27.5406885Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T11:11:27.5482796Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T11:11:27.5561279Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T11:11:27.5644466Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T11:11:27.5723188Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T11:11:27.5799431Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T11:11:27.5881274Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T11:11:27.5964253Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T11:11:27.6043958Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T11:11:27.6124659Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T11:11:27.6204089Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T11:11:27.6280450Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T11:11:27.6361319Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T11:11:27.6448205Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T11:11:27.6525292Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T11:11:27.6608597Z Entering 'third_party/kleidiai' 2025-12-04T11:11:27.6689785Z Entering 'third_party/mimalloc' 2025-12-04T11:11:27.6768988Z Entering 'third_party/nlohmann' 2025-12-04T11:11:27.6849984Z Entering 'third_party/onnx' 2025-12-04T11:11:27.6947349Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T11:11:27.7033706Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T11:11:27.7113750Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T11:11:27.7190548Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T11:11:27.7267317Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T11:11:27.7343125Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T11:11:27.7422493Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T11:11:27.7498644Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T11:11:27.7576495Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T11:11:27.7655091Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T11:11:27.7734005Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T11:11:27.7816756Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T11:11:27.7916807Z Entering 'third_party/pocketfft' 2025-12-04T11:11:27.7997910Z Entering 'third_party/protobuf' 2025-12-04T11:11:27.8078283Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T11:11:27.8158761Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T11:11:27.8241164Z Entering 'third_party/psimd' 2025-12-04T11:11:27.8319989Z Entering 'third_party/pthreadpool' 2025-12-04T11:11:27.8397705Z Entering 'third_party/pybind11' 2025-12-04T11:11:27.8483256Z Entering 'third_party/python-peachpy' 2025-12-04T11:11:27.8563464Z Entering 'third_party/sleef' 2025-12-04T11:11:27.8643166Z Entering 'third_party/tensorpipe' 2025-12-04T11:11:27.8720797Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T11:11:27.8797480Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T11:11:27.8880823Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T11:11:27.8960775Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T11:11:27.9035360Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T11:11:27.9146607Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T11:11:27.9176151Z http.https://github.com/.extraheader 2025-12-04T11:11:27.9188740Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T11:11:27.9227308Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T11:11:27.9630360Z Entering 'android/libs/fbjni' 2025-12-04T11:11:27.9681510Z http.https://github.com/.extraheader 2025-12-04T11:11:27.9733889Z Entering 'third_party/FP16' 2025-12-04T11:11:27.9788063Z http.https://github.com/.extraheader 2025-12-04T11:11:27.9837312Z Entering 'third_party/FXdiv' 2025-12-04T11:11:27.9887991Z http.https://github.com/.extraheader 2025-12-04T11:11:27.9937809Z Entering 'third_party/NNPACK' 2025-12-04T11:11:27.9988674Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0039416Z Entering 'third_party/NVTX' 2025-12-04T11:11:28.0090692Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0140262Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T11:11:28.0191777Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0241684Z Entering 'third_party/XNNPACK' 2025-12-04T11:11:28.0292514Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0358621Z Entering 'third_party/aiter' 2025-12-04T11:11:28.0414595Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0465188Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T11:11:28.0515998Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0576419Z Entering 'third_party/benchmark' 2025-12-04T11:11:28.0629961Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0679278Z Entering 'third_party/composable_kernel' 2025-12-04T11:11:28.0730634Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0788912Z Entering 'third_party/cpp-httplib' 2025-12-04T11:11:28.0843410Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0892554Z Entering 'third_party/cpuinfo' 2025-12-04T11:11:28.0944595Z http.https://github.com/.extraheader 2025-12-04T11:11:28.0996156Z Entering 'third_party/cudnn_frontend' 2025-12-04T11:11:28.1047124Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1097548Z Entering 'third_party/cutlass' 2025-12-04T11:11:28.1151291Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1212454Z Entering 'third_party/fbgemm' 2025-12-04T11:11:28.1263861Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1314848Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T11:11:28.1365088Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1415611Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T11:11:28.1466446Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1523351Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T11:11:28.1572693Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1623550Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T11:11:28.1674219Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1733140Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T11:11:28.1784178Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1833221Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T11:11:28.1883679Z http.https://github.com/.extraheader 2025-12-04T11:11:28.1932964Z Entering 'third_party/fbgemm/external/json' 2025-12-04T11:11:28.1983769Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2041134Z Entering 'third_party/flash-attention' 2025-12-04T11:11:28.2092214Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2140338Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T11:11:28.2190093Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2248664Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T11:11:28.2297939Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2360386Z Entering 'third_party/flatbuffers' 2025-12-04T11:11:28.2412010Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2465720Z Entering 'third_party/fmt' 2025-12-04T11:11:28.2518831Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2568297Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T11:11:28.2622381Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2672310Z Entering 'third_party/gloo' 2025-12-04T11:11:28.2723605Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2772792Z Entering 'third_party/googletest' 2025-12-04T11:11:28.2824024Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2876937Z Entering 'third_party/ideep' 2025-12-04T11:11:28.2929477Z http.https://github.com/.extraheader 2025-12-04T11:11:28.2974920Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T11:11:28.3025888Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3084729Z Entering 'third_party/ittapi' 2025-12-04T11:11:28.3138721Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3188467Z Entering 'third_party/kineto' 2025-12-04T11:11:28.3240475Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3287651Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T11:11:28.3338733Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3388077Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T11:11:28.3438625Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3490210Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T11:11:28.3541186Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3591288Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T11:11:28.3644392Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3693690Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T11:11:28.3743939Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3791551Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T11:11:28.3842310Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3896354Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T11:11:28.3948197Z http.https://github.com/.extraheader 2025-12-04T11:11:28.3999110Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T11:11:28.4050851Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4102196Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T11:11:28.4152762Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4206511Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T11:11:28.4256279Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4306582Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T11:11:28.4357808Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4408742Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T11:11:28.4459797Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4513459Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T11:11:28.4568712Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4630489Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T11:11:28.4680640Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4728800Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T11:11:28.4778208Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4831310Z Entering 'third_party/kleidiai' 2025-12-04T11:11:28.4882654Z http.https://github.com/.extraheader 2025-12-04T11:11:28.4933497Z Entering 'third_party/mimalloc' 2025-12-04T11:11:28.4984013Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5033916Z Entering 'third_party/nlohmann' 2025-12-04T11:11:28.5087392Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5141128Z Entering 'third_party/onnx' 2025-12-04T11:11:28.5191776Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5257738Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T11:11:28.5309169Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5364682Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T11:11:28.5416435Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5466504Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T11:11:28.5516263Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5569088Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T11:11:28.5620086Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5668094Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T11:11:28.5721404Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5769023Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T11:11:28.5818911Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5870516Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T11:11:28.5923190Z http.https://github.com/.extraheader 2025-12-04T11:11:28.5971867Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T11:11:28.6023294Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6073736Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T11:11:28.6124890Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6171313Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T11:11:28.6221357Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6272461Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T11:11:28.6322091Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6374791Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T11:11:28.6425295Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6498507Z Entering 'third_party/pocketfft' 2025-12-04T11:11:28.6550571Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6602280Z Entering 'third_party/protobuf' 2025-12-04T11:11:28.6653523Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6705480Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T11:11:28.6755943Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6805028Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T11:11:28.6855165Z http.https://github.com/.extraheader 2025-12-04T11:11:28.6909085Z Entering 'third_party/psimd' 2025-12-04T11:11:28.6960875Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7010672Z Entering 'third_party/pthreadpool' 2025-12-04T11:11:28.7062003Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7112546Z Entering 'third_party/pybind11' 2025-12-04T11:11:28.7164181Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7214123Z Entering 'third_party/python-peachpy' 2025-12-04T11:11:28.7264633Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7314076Z Entering 'third_party/sleef' 2025-12-04T11:11:28.7364890Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7415145Z Entering 'third_party/tensorpipe' 2025-12-04T11:11:28.7467120Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7517906Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T11:11:28.7567578Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7616272Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T11:11:28.7667250Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7715327Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T11:11:28.7765496Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7814057Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T11:11:28.7863179Z http.https://github.com/.extraheader 2025-12-04T11:11:28.7910212Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T11:11:28.7962495Z http.https://github.com/.extraheader 2025-12-04T11:11:28.8045136Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:28.8094189Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T11:11:28.8493527Z Entering 'android/libs/fbjni' 2025-12-04T11:11:28.8529561Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T11:11:28.8553784Z Entering 'third_party/FP16' 2025-12-04T11:11:28.8589634Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T11:11:28.8614588Z Entering 'third_party/FXdiv' 2025-12-04T11:11:28.8649237Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T11:11:28.8674003Z Entering 'third_party/NNPACK' 2025-12-04T11:11:28.8708827Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T11:11:28.8733897Z Entering 'third_party/NVTX' 2025-12-04T11:11:28.8768245Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T11:11:28.8793662Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T11:11:28.8829738Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T11:11:28.8855827Z Entering 'third_party/XNNPACK' 2025-12-04T11:11:28.8891221Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T11:11:28.8932083Z Entering 'third_party/aiter' 2025-12-04T11:11:28.8967261Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T11:11:28.8991358Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T11:11:28.9025199Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T11:11:28.9063449Z Entering 'third_party/benchmark' 2025-12-04T11:11:28.9102048Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T11:11:28.9127689Z Entering 'third_party/composable_kernel' 2025-12-04T11:11:28.9162322Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T11:11:28.9195913Z Entering 'third_party/cpp-httplib' 2025-12-04T11:11:28.9231810Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T11:11:28.9257157Z Entering 'third_party/cpuinfo' 2025-12-04T11:11:28.9291291Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T11:11:28.9317186Z Entering 'third_party/cudnn_frontend' 2025-12-04T11:11:28.9356726Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T11:11:28.9381778Z Entering 'third_party/cutlass' 2025-12-04T11:11:28.9418293Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T11:11:28.9453473Z Entering 'third_party/fbgemm' 2025-12-04T11:11:28.9489406Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T11:11:28.9515199Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T11:11:28.9547864Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T11:11:28.9572440Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T11:11:28.9608697Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T11:11:28.9641762Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T11:11:28.9675517Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T11:11:28.9699775Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T11:11:28.9734869Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T11:11:28.9770889Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T11:11:28.9806938Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T11:11:28.9831641Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T11:11:28.9864655Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T11:11:28.9888572Z Entering 'third_party/fbgemm/external/json' 2025-12-04T11:11:28.9922998Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T11:11:28.9952656Z Entering 'third_party/flash-attention' 2025-12-04T11:11:28.9987863Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T11:11:29.0012797Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T11:11:29.0047959Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T11:11:29.0078805Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T11:11:29.0113049Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T11:11:29.0148955Z Entering 'third_party/flatbuffers' 2025-12-04T11:11:29.0186256Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T11:11:29.0215581Z Entering 'third_party/fmt' 2025-12-04T11:11:29.0250171Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T11:11:29.0275320Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T11:11:29.0311330Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T11:11:29.0336048Z Entering 'third_party/gloo' 2025-12-04T11:11:29.0371701Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T11:11:29.0396989Z Entering 'third_party/googletest' 2025-12-04T11:11:29.0433098Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T11:11:29.0458118Z Entering 'third_party/ideep' 2025-12-04T11:11:29.0495556Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T11:11:29.0518350Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T11:11:29.0553505Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T11:11:29.0587634Z Entering 'third_party/ittapi' 2025-12-04T11:11:29.0623951Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T11:11:29.0648861Z Entering 'third_party/kineto' 2025-12-04T11:11:29.0684072Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T11:11:29.0707820Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T11:11:29.0741178Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T11:11:29.0763829Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T11:11:29.0798772Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T11:11:29.0825054Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T11:11:29.0860437Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T11:11:29.0885250Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T11:11:29.0920743Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T11:11:29.0945382Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T11:11:29.0984938Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T11:11:29.1006338Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T11:11:29.1039806Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T11:11:29.1068394Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T11:11:29.1105367Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T11:11:29.1129124Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T11:11:29.1166071Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T11:11:29.1190394Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T11:11:29.1225199Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T11:11:29.1252359Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T11:11:29.1286902Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T11:11:29.1312041Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T11:11:29.1346677Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T11:11:29.1369113Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T11:11:29.1402970Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T11:11:29.1430473Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T11:11:29.1466526Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T11:11:29.1499096Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T11:11:29.1535861Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T11:11:29.1559140Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T11:11:29.1591973Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T11:11:29.1620276Z Entering 'third_party/kleidiai' 2025-12-04T11:11:29.1657816Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T11:11:29.1683905Z Entering 'third_party/mimalloc' 2025-12-04T11:11:29.1719490Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T11:11:29.1745149Z Entering 'third_party/nlohmann' 2025-12-04T11:11:29.1784667Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T11:11:29.1814319Z Entering 'third_party/onnx' 2025-12-04T11:11:29.1849921Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T11:11:29.1891001Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T11:11:29.1925580Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T11:11:29.1956687Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T11:11:29.1992962Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T11:11:29.2017342Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T11:11:29.2051414Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T11:11:29.2076149Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T11:11:29.2110228Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T11:11:29.2134401Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T11:11:29.2168192Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T11:11:29.2192285Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T11:11:29.2226357Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T11:11:29.2252926Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T11:11:29.2285872Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T11:11:29.2309913Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T11:11:29.2343468Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T11:11:29.2372105Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T11:11:29.2405661Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T11:11:29.2427314Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T11:11:29.2460912Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T11:11:29.2486543Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T11:11:29.2520428Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T11:11:29.2547076Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T11:11:29.2594292Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T11:11:29.2632259Z Entering 'third_party/pocketfft' 2025-12-04T11:11:29.2668427Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T11:11:29.2692806Z Entering 'third_party/protobuf' 2025-12-04T11:11:29.2727874Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T11:11:29.2754076Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T11:11:29.2787081Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T11:11:29.2812124Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T11:11:29.2849888Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T11:11:29.2878036Z Entering 'third_party/psimd' 2025-12-04T11:11:29.2915363Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T11:11:29.2940256Z Entering 'third_party/pthreadpool' 2025-12-04T11:11:29.2974794Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T11:11:29.2999495Z Entering 'third_party/pybind11' 2025-12-04T11:11:29.3035267Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T11:11:29.3060634Z Entering 'third_party/python-peachpy' 2025-12-04T11:11:29.3097810Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T11:11:29.3123196Z Entering 'third_party/sleef' 2025-12-04T11:11:29.3159304Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T11:11:29.3183898Z Entering 'third_party/tensorpipe' 2025-12-04T11:11:29.3220280Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T11:11:29.3243348Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T11:11:29.3275764Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T11:11:29.3299677Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T11:11:29.3334152Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T11:11:29.3357649Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T11:11:29.3392923Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T11:11:29.3416800Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T11:11:29.3449600Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T11:11:29.3471514Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T11:11:29.3505793Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T11:11:29.3563396Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3600974Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3639766Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3676233Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3709762Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3743423Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3777540Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3811860Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3844922Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3877785Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3911507Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3945644Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.3979593Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4014050Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4046604Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4079899Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4112792Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4146040Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4178251Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4212217Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4245753Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4278230Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4314965Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4347960Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4380224Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4414485Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4446333Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4480784Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4515019Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4547167Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4576646Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4610884Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4643465Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4676666Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4712750Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4746825Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4780096Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4814902Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4848720Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4883528Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4917119Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4950959Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.4989881Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5026282Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5072786Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5109543Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5142295Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5175401Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5209394Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5244739Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5276416Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5309922Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5343046Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5375745Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5408659Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5443090Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5475718Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5508295Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5540354Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5573182Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5607989Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5642779Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5675287Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5710253Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5746585Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5781724Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5823953Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5858053Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5892700Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5927063Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5959947Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.5993557Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6027273Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6061627Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6095663Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6131695Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6165080Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6198657Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6232515Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6270442Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6300918Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T11:11:29.6446987Z A job completed hook has been configured by the self-hosted runner administrator 2025-12-04T11:11:29.6468200Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-12-04T11:11:29.6476373Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T11:11:29.6476754Z ##[endgroup] 2025-12-04T11:11:29.6587125Z [!ALERT!] Swap in detected! [!ALERT!] 2025-12-04T11:11:40.7263215Z [!ALERT!] Swap out detected [!ALERT!] 2025-12-04T11:11:59.7584592Z Cleaning up orphan processes